So one of my goals for the new year is to better document and present my side projects. I have a habit of playing around with an idea and getting a very bare bones prototype, and then letting it sit and languish. So to kick things off, I present to you Surf Check.
One of the first python programs I ever wrote was a surf scraping program that pulled surf forecast data from SwellIinfo.com, a surf forecast website, for 6 different locations and printed out a condensed surf report for all of the locations on a single page. Besides getting experience with python, the main reason I wrote the code was because it was a pain to try and flip through a handful of surf forecast pages on the surf website. The website is loaded with graphics and ads, and it is not easily navigable. So a quick check of the surf forecast would end up taking over 5 minutes, accounting for page loads and navigation. I figured building a scraper to grab the data I wanted and condense it down to one page was a perfect first project.
I wrote the initial scraper around March, 2013 when I was just getting started with Python. Overtime I tinkered around with the program, and eventually decided to re-write it and turn it into a small web page to make it easier to access. So I added a flask server, re-wrote the scraper code, and set up a simple frontend with a jinga template serving basic html.
Comparing the before and after, I was able to make some pretty big improvements to the original code. The original scraper was over 160 lines, and the new project is ~140 lines, including flask server, html template, and scraper. Of course the comparison is thrown off by the fact that for some reason when I wrote the original program, I couldn’t get Beautiful Soup (a.k.a. bs4, a python html parser) to work. My guess is it was due to my unfamiliarity with object oriented programming and python in general, but I did a weird workaround where I saved the bs4 output to a text file, imported the text file and then parsed the text to get what I needed. Ahhh, yes, the things we do when we are inexperienced! Makes me cringe now, but it is a good lesson. Had I gotten bs4 to work the first time around, I am pretty sure the scraper code would have been pretty similar to my final version.
A quick note on the code for the project. Below is the bulk of the code that makes up the views.py file.
app = Flask(__name__) # Keep track of the time between scrapes to prevent uncessesarry requests LAST_SCRAPE = datetime.datetime.now() # Get an intial set of scaped data SPOT_CONDITIONS = run_scraper() print LAST_SCRAPE, SPOT_CONDITIONS # Intervals between scrapes DELTA = datetime.timedelta(hours=4) def get_cond_data(): """ Returns a dictionary of spot conditions. Uses a global to save the forecast data between requests. If the app has just been initialized, it will run the scrpaer, ohterwise, it will re-run the scraper if the last scrape is over 4 hours old. """ global SPOT_CONDITIONS, LAST_SCRAPE now = datetime.datetime.now() if now - LAST_SCRAPE > DELTA: SPOT_CONDITIONS = run_scraper() LAST_SCRAPE = now return SPOT_CONDITIONS @app.route('/') def surfs_up(): """ Returns surf forecast. """ spot_conditions = get_cond_data() return render_template('main.html', last_update=str(LAST_SCRAPE), spots=spot_conditions)
Scraping the forecast data from the surf website takes a while, ~9 seconds for all six pages. If I was expecting a lot of traffic, I would set up a scheduler that would automatically scrape the site in the background so that the data would be available immediately when a request hit the server. However, as this is just a small project that I am doing for fun, I don’t want to hit Swellinfo’s servers with excessive scrape requests. So I decided to scrape only when a request was made to my site. The obvious downside is that this results in a really long load time for the page, as it has to wait for the scrape to finish before it can serve the data. To mitigate this issue slightly, and to further limit requests to Swellinfo’s servers, I store the forecast data for a period of time (surf forecasts typically only get updated every 12 hours or so). At the moment, I have that period set to 4 hours, so if the scraped data is over four hours when a request hits my homepage, it will re-scrape the data, however every homepage request in the next 4 hours will get served the saved scraped data. Additionally, to keep things simple I choose to forgo any persistent data storage. So at the moment, the scraped data gets stored in a global variable (SPOT_CONDITIONS). While using global variables in python are looked down, I thought it was an interesting way to change up the typical flasks MVC (Model-View-Controller) model. Essentially I have just reduced it down to VC.
I thought that code snippet was fun because, despite it’s apparent simplicity, it hides some complex design decisions. In the future, it might make sense to implement a more robust scraping mechanism, perhaps by figuring out the exact times that the Swellinfo surf forecasts get updated, and then only re-scraping if the data is old (instead of arbitrarily using the 4 hour cutoff). I have a few ideas for improvements or features I would like to add to the site, but I also have some more ambitious projects on my plate that are hogging my attention, so well see if I get around to it. If you want to check out either the old scraper code (my first python program) of this current iteration of the project, the links are here: My First Python Program!!! Surf Check Github Repo!!!!