Dark theme

Intro to web scraping


To scrape our webpage, we'll use the HTML Parser "BeautifulSoup".


First, make a new directory for your Python code. If you've got Python by installing Anaconda, then it comes with BeautifulSoup. If not, you'll need to download it from the BeautifulSoup download page.


Now download this class into the same directory: scraper.py. Open it up and have a look at it. It demonstrates a few things.

First, how to get a page:

page = requests.get("http://www.geog.leeds.ac.uk/courses/computing/practicals/python/web/scraping-intro/table.html")
content = page.text


Secondly, how to get an element with a specific id:

table = soup.find(id="datatable")


Third, how to get all the elements with a specific tag and loop through them:

trs = table.find_all('tr')

for tr in trs:
    # Do something with the "tr" variable.


And finally, how to get the text inside an element:

tds = tr.find_all("td")

for td in tds:
    print (td.text)


Note that even though the tags are in upper case in the file, when they are parsed, they get lowercased, so that's what we search for.


The Beautiful Soup library (homepage) is beautifully written, and comes with some fairly clear documentation.

For scraping Twitter, you need tweepy, and for most things a Twitter developer's key. See also the Developers' site. Similar libraries exist for other social media sites.