| First Posted 09 July 2019 | Last Edited 09 July 2019 | Philip Kiely |
My mother wrote a blog, Overneath it All, for many years. Goodreads has an archive as far back as 2015, but she wrote from well before that, starting in 2011. I didn’t read much of it, but I was featured in it regularly. She never referenced me by name, instead using my age (at the time, “Fifteen did this” or “Eighteen did that”). In January 2019, she decided to shut the blog down after not writing for it for six months, but she wanted an editable archive in case she wanted to use anything from it in the future. At her request, I scraped the entire blog into a giant Microsoft Word file.
Web scraping is the sort of low hanging fruit in automation that has an outsized usefulness compared to its difficulty. Getting the code to work just so isn’t always straightforward, but the impact of the end product usually ranges from “saves hours of repetitive, error-prone manual copying” to “makes a previously impossible thing happen overnight.” This project definitely falls under the former category, the few hundred articles would have been possible but frustrating to manually copy and format into a complete corpus. Fortunately, on the account of not being a medieval scribe, I was able to use Python and BeautifulSoup to archive the blog.
You can see the code here, but it is very messy and specific. In general, writing good, clean code for a web scraper is very difficult because of the specificity of the data source and the likelihood of edge cases. In this case, I didn’t need clean code because I was certainly only going to use it once (after all, the website no longer exists), so I boiled a pot of spaghetti and threw it at the site. There were a variety of edge cases: posts with and without various metadata, images of various sizes, formats, and placements, and even one post from 2016 that somehow was its own page, like an “About” page would be: totally separate from the post feed. Through trial and error, I gathered the contents of the blog into the sort of Microsoft Word document that heats up your computer to 90 degrees Celsius just by opening it.
In general, web scraping takes three steps: generate the links, scrape the pages, save the data. In this project, I generated the links by making the entire range of month-year archive links for the history of the blog. Thus, I had some pages with no posts, some with one post, and others with multiple posts. To deal with that, the second step involved a complex function (by which I mean 60 lines of if-else and try-except) that grabbed all of the text and images, plus a similar function for the aforementioned extra page. For the final step, I stored each post in a custom object, then iterated over an array of those objects, writing formatted text to a Microsoft Word file using the
Web scraping can be frustrating and specific, but it is a powerful tool in a programmer’s arsenal. It’s helpful, but not necessary, to have done a fair bit of web programming yourself, as it reduces the number of concepts that you need to learn. If you want to practice web scraping by following along though a project, check out my article on FloydHub, which covers scraping Hacker News “Who is Hiring” threads.
Thanks for reading! For more, sign up for my email newsletter for fresh, original content on programming, technology, and business delivered to your inbox weekly.