The Wizarding Network of Harry PotterNovember 05, 2014
While that code is still being integrated into the CoS infrastructure, I wanted to write a post applying/extending my code to other datasets. After some thought, I figured creating a network of Harry Potter character connections would be pretty neat. While I found this, unfortunately I was looking for a dataset that contained more than 65 characters. Looks like a job for web scraping!
First, I scraped a list of 178 characters taken from wikipedia. These characters also have their own pages on a Harry Potter wiki site, and the majority of these names were identical to what needed to be added to the URL to access the page. A quick code check revealed only a few mistakes to be corrected, but luckily not much manual labor was required in this step.
Next, I made the assumption that on a given character’s page, any name that was hyperlinked in the main text share some “logical connection” relevant in the Harry Potter universe. Of course, this assumption may not be completely tenable, but the results this came up with seemed reasonable. Thus, each character page was scraped for an image to use on a custom slider I created, as well as a list of connections to the other characters on my list.
Both Hadley Wickham’s new R web scraping tool rvest and the awesome SelectorGadget actually made this web scraping task much easier than I thought it would be. Note that rvest is not yet available on CRAN, but you can get the development version off his GitHub using devtools. One of my next posts will outline how to use rvest in more detail.