Web Processing

Dr Andy Evans

[Fullscreen]

Getting web pages

  • First we need to get the webpage by issuing a HTTP request. The best option for this is the requests library that comes with Anaconda:
    http://docs.python-requests.org/en/master/

    r = requests.get('https://etc', auth=('user', 'pass'))
    The username and password is optional.
  • To get the page:
    content = r.text

Other variables and functions

  • HTTP status codes returned by servers as well as any HTML and files:
    https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
    Captured in r.status_code: 200 OK
    204 No Content
    400 Bad Request
    401 Unauthorized
    403 Forbidden
    404 Not Found
    408 Request Timeout
    500 Internal Server Error
    502 Bad Gateway
    (for servers passing on requests elsewhere)
    504 Gateway Timeout (for servers passing on requests elsewhere)

JSON

Other options

  • Ability to deal with cookies.
  • Ability to pass parameters to servers in a variety of ways.
  • Ability to maintain sessions with a server.
  • Ability to issue custom headers representing different browsers ("user-agent"), etc.
  • Ability to deal with streaming.

Processing webpages

How to get elements

  • Getting elements by ID or other attributes:
    table = soup.find(id="yxz")
    tds = soup.find_all(attrs={"class" : "y"})

    Getting all elements of a specific tag:
    trs = table.find_all('tr')

    for tr in trs:
        # Do something with the "tr" variable.


    Getting elements inside another and get their innerHTML:
    tds = tr.find_all("td")

    for td in tds:
        print (td.text)
    All tags are lowercased during search.

Client side coding

  • Generally done in JavaScript.
  • Very similar to Python.
  • Each statement ends in a semicolon;
    Blocks are defined by {}

    function dragStart(ev) {}
    if (a < b) {
    } else {
    }
    for (a = 0; a < b; a++) {}

    var a = 12;
    var a = [1,2,3];
    // Comment
    /**
    * Comment
    **/

Getting elements in Javascript

  • document is the root of the page.
    var a = document.getElementById("yxz")

    var a = document.getElementsByClassName("datatable");

    var tds = document.getElementsByTagName("TD");

  • Getting text:
    alert(tds[0].innerHTML) // popup box
    console.log(tds[0].innerHTML ) // Browser console
            // (F12 to open with most)

    Setting text:
    tds[0].innerHTML = "2";

Connecting JavaScript

  • JavaScript is largely run through Event Based Programming.
  • Each HTML element has specific events associated with it. We attach a function to run to these thus:
    <SPAN id="clickme" onclick="functionToRun()">Push</SPAN>
    <BODY onload="functionToRun()">

Where to put JavaScript

  • Functions placed between tags in either the head or body. In the body code will run in the order the page loads if not in functions.
  • Alternatively, can be in an external script linked to with a filename or URL in the body or head, thus:
    <script src="script.js"></script>
  • Example:
    <HTML>
    <HEAD>
    <SCRIPT>
    function clicked() {
        var a = document.getElementById("clickme");
        a.innerHTML = "changed";
    }
    </SCRIPT>
    </HEAD>
    <BODY>

    <SPAN id="clickme" onclick="clicked()">Push</SPAN>

    <BODY>
    </HTML>