How I created a SNL dataset with Scrapy


Not long ago Kaggle got the new dataset feature. Every member of the community can now upload their own datasets for others to play with. This is a very cool thing and there are lots of interesting datasets out there. You can also use Kaggle to promote your dataset. I was thinking about a dataset that I could provide and when I was reading through the LiveFromNewYork subreddit I got the idea: what about a Saturday Night Live dataset? I searched around the web and found the website which has a very comprehensive database. I contacted the creator but got no answer. But I didn't want to stop my project before it really began so I decided to try to scrape the data from the website. This blog post shows you how I did that and what we can learn from over 40 seasons of hilarious data.

Where did I get the data?

To compile the dataset I used the following two sources:

How did I get the data?

To create this database I used python and Scrapy. Scrapy is a framework to scrape data from the web. If you want to learn about how that works look at the notebook that explains the process. It can be found in the GitHub Repository for the project.

To display graphs in my analysis I used bokeh. Sadly GitHub does not support it. Therefor my graphs do not appear if you open the notebooks on github. If you want the full experience please clone the repository and open the notebook in an environment that supports bokeh. The easiest way would be to create an anaconda environment with the following python modules installed:

  • pandas
  • numpy
  • bokeh
  • scrapy

Example analysis

If you want to see an example analysis of the data, please refer to my analysis notebook. It answers questions like:

  • How have the ratings developed over the years?
  • Which actors had the biggest presence on the show (most titles per episode on average)?
  • Which actor had the most appearances in a single episode?

If you have some ideas for other questions to answer, just send them to me or play with the data yourself.

Dataset at kaggle

The dataset is available at kaggle, too. Just use one of the following links: