Dark theme

Data processing


In this practical, we'll work through some tutorials on data processing. One for pandas, and then a second for bokeh, a library that uses webpages for display, making the link between Python and JavaScript.

In this part, we'll also refamiliarise ourselves with Jupyter/iPython notebooks. Jupyter makes a great data analysis environment because you can annotate the code as you go with images, text, and links, making it a great data processing auditing system. It also means that reports can continually be kept up to date by connecting them to data stores and re-running them when a refresh is needed. Finally, it can be integrated with dynamic elements to produce interactive data "dashboards".

First up, then, pandas. For this, we'll work through the Software Carpentry course materials on pandas. Software Carpentry is an international organisation of volunteers teaching good programming style to researchers. They have a wide variety of courses in everything from shell scripting to Git, but one of their latest courses is on learning to program in Python using pandas.

As we already know quite a lot of Python, we don't need to do the whole of the course: just "Setup" to download the dataset (unzip it somewhere sensible – it should create a data directory) and get Jupyter up and running (run it from a command prompt open in the same directory as the data directory, i.e. not the data directory, but the directory containing it); and then parts 7, 8 and 9 of the course materials (that is, from "Reading Tabular Data into DataFrames" to "Lunch"). Give it a go. A few things to help:

  • You'll need to make yourself a new Python notebook in the directory you open Jupyter from.
  • Remember that you press SHIFT + ENTER to run a box of code in Jupyter.
  • Note that complicated notebooks can take a few seconds (up to 30) to run. If they don't quickly complain, they're probably running.
  • The microbe data question is theoretical – you don't actually have this data downloaded.
  • To understand wealth_score = mask_higher.aggregate('sum',axis=1)/len(data.columns) remember that in Python True == 1 and False == 0 (numbers >1 also == True) and that in pandas axis=1 works across rows. Print mask_higher if that helps.
  • "Magic" commands are commands directly to iPython/Jupyter. They start with a percentage sign.

  • Now you've had a bit of a go at working with Jupyter, it is probably worth going back to our first practical from the core course. Read it through again and refamiliarise yourself with the the Jupyter can do. Then check out the example notebook listed in that practical and see if you can work out how it is working (note, it may take a few seconds to work). It uses numpy, but perhaps more interesting is that it includes interactive elements. See if you can understand how these work; the documentation is linked at the top of the notebook.

    Once you've done that, go on to the next section to look at Bokeh. Jupyter will let us save notebooks as static copies in HTML, but this isn't quite the same as integrating analyses dynamically into a website. Bokeh is much more focused on this job.


    1. This page
    2. Bokeh <-- next