Web data scraper


Most social media analysis starts with getting hold of some text data, whether it be tweets, tags, some blog text, or a list of links to friends. For a project, why not start looking at scraping and then analysing text from the web?

For this project, you could:

Day one: build a web page scraper to scrape a webpage of text.

Day two: either get additional data from Twitter, or analyse the data from the webpage.


Day One

Check out the extra materials on network programming, specifically the web/HTML tutorial if you aren't familiar with HTML, and the Webpage scraping tutorial. Have a go at building your own website and scraping data from either it, or someone else's site. You should have your own personal website on the Leeds system, but you can also scrape off local pages on your harddrive if not.

Day Two

Have a go at the Twitter scraper stuff linked from the scaping practical. Explore the linguistic analysis options in the OpenNPL package (see the third set of slides in the "scientific libraries" extra materials).