Web-scraping with Java


To scrape our webpage, we'll use the HTML Parser "jsoup".


First, make a new directory for your Java code. Then, go to the jsoup download page and download the "jar" file called "core library.

This library includes the packages:
org.jsoup
org.jsoup.helper
org.jsoup.nodes
org.jsoup.select
org.jsoup.parser
org.jsoup.safety
org.jsoup.examples

You can get at these but unzipping the file if you like (jars are zip files with a different name and one extra file inside). However, don't do this for the moment -- we'll use it as a zipped jar so we can get used to that instead.


Now download this class into the same directory: Scraper.java. Open it up and have a look at it. It demonstrates a few things.

First, how to get a page:

Document doc = null;
try {
   doc = Jsoup.connect("http://www.geog.leeds.ac.uk/.../table.html").get(); // URL shortened!
} catch (IOException ioe) {
   ioe.printStackTrace();
}

If you'd download the page to your harddrive in order to experiment without hitting the page online (which seems polite) you'd do this:

File input = new File("c:/pages/table.html");
Document doc = null;
try {
   doc = Jsoup.parse(input, "UTF-8", "");
} catch (IOException ioe) {
   ioe.printStackTrace();
}


Secondly, how to get an element with a specific id:

Element table = doc.getElementById("datatable");


Third, how to get all the elements with a specific tag and loop through them:

Elements rows = table.getElementsByTag("TR");

for (Element row : rows) {
   // Do something with the "row" variable.
}


And finally, how to get the text inside an element:

Elements tds = row.getElementsByTag("TD");

for (int i = 0; i < tds.size(); i++) {
   System.out.println(tds.get(i).text()); // Though our file uses every second element.
}


So, let's run the class. Making sure that the class and the jar file are in the same directory, we can ask the compiler to look inside the jar file for classes it needs, thus:

javac -cp .;jsoup-1.7.3.jar *.java

And likewise the JVM:

java -cp .;jsoup-1.7.3.jar Scraper

Give it a go -- it should scrape our table.html from the first part of the practical.


The jsoup library (homepage) is beautifully written, and comes with a very clear cookbook of how to do stuff, along with detailed API docs. The cookbook sometimes lacks a list of packages to import (just import everything if in doubt), but otherwise is a great starting point.

If your data is in XML, your best starting point is this XML lecture and practical.

If your data is in JSON, you can get the JSON data as a String using:

String json = Jsoup.connect(url).ignoreContentType(true).execute().body();

and then parse it (split it into components) using a JSON library like the standard one or gson.

For scraping Twitter, you need twitter4j, and for most things a Twitter developer's key. See also the Developers' site. Similar libraries exist for other social media sites.