Web basics

Dr Andy Evans

[Fullscreen]

The internet and web

  • The internet: fundamental network over which different communication technologies can work (including email, file transfer, the web). Generally regarded as those elements using the TCP/IP protocol (more later).
  • The web: a hypertext (linked text) system with embedded media based within/utilising the internet.

Python and the internet

  • Mainly used for data retrieval and processing, but can be used for backend web work and internet communication.
  • To understand the web-based elements we need to first understand the basics of the web and webpages.

The web

  • The web has a client-server architecture.
Client-server relationship

Setting up servers

  • The key element of a client-server system is the socket ("Berkley socket "; "POSIX socket").
  • Sockets are connections between machines which you can connect streams to. You can then write and read data to/from them.
  • Basic operation is for a client to contact a machine via and address and a port number. The server program sits on a machine and listens to the port waiting for contact using a server socket. When contact occurs, it generates a standard socket connected at the other end to the client socket.

Client

  • Open a socket to the server on the client:
    import socket

    socket_1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    socket_1.connect(("localhost", 5555)) # Address tuple

    socket_1.send(bytes("hello world", encoding="UTF-8"))

  • Here the address of the machine we're trying to connect to is "localhost"; this indicates the local machine we're on. "5555" is the "port" number. We'll come back to this shortly. We're sending the data as bytes representing a Unicode encoded string.

Server

  • Wait for a connection and then generate a socket for communication:
    import socket

    serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    serversocket.bind(('localhost', 5555))
    serversocket.listen()
    (socket_2, address) = serversocket.accept()
    b = socket_2.recv(30)
    print(str(b))
  • As we've bound this program to the localhost we'd need to run this on the machine the client is on. For internet enabled code we'd change this to the address of the machine (we'll come back to this shortly). The program is set to receive up to 30 bytes of data. It will print "b'hello world'". The "b" indicates binary data as that's how the client sent it.

The web / internet

  • A full internet or web application also has to:
  • Open multiple sockets to different clients. One server can serve several clients at once.
  • Send data of arbitrary lengths.
  • Deal with potential security issues.
  • The core technology is the same though.

Client - Server architecture

  • Here we've sent relatively plain binary data representing text. However, it would be usual to have a more complicated format known as a protocol, which the server recognises and processes.
  • E-mails are sent out to servers using Simple Mail Transfer Protocol (SMTP)
  • Webpages are sent out from servers using the HyperText Transfer Protocol (HTTP).

Introduction to network communications

  • Several protocols may be involved at once. Most computers use "TCP/IP" when communicating with network nodes and other computers on the internet.
  • Internet Protocol (IP):
    • Used to split data into small chunks called "packets"
    • Addresses them to the right machine.
  • Transport Control Protocol (TCP):
    • Guarantees packets get to their destination.
    • Controls the route taken and lets computers confirm receipt.
    • Adds packets back together in the right order.
  • Protocols like HTTP then format the data carried by these low-level protocols which is split into packets.

Ports

  • Ports are numerical handles which software can associate with; one piece of software per port.
  • Which server program gets which messages will depend on the port they are sent to, which is usually related to the transmission protocol. The computer looks at the port number associated with the message and diverts it to the registered software.
  • e-mails are sent out to servers using Port 25.
  • Webservers use port 80.
  • We used port 5555 as the first 1024 ports are allocated to specific purposes and protocols.

IP addresses

  • An important element of the system is the Internet Protocol (IP) addresses of the machines to receive the messages.
  • IP addresses are numeric: 129.11.87.11 is the School webserver.
  • However, many network machines hold a registry which contains the numbers and what they'd prefer to be called. This scheme is called the Domain Name Service (DNS).
  • www.geog.leeds.ac.uk is the domain name of 129.11.87.11
  • In order for code to use domain names, they must perform a DNS lookup, contacting the nearest DNS.
  • localhost is a special name that maps to 127.0.0.1, which always means the local machine you're on.

Ports and Firewalls

  • If you set up client/server software using sockets, always check what other programs, if any, use the port you are on.
  • Some networks have "Firewalls". These are security devices that sit on ports and stop some connecting. Check for them. In general, scanning to see which ports are open ("port scanning") is regarded as suspicious behaviour, so keep in close contact with your local IT team.

Understanding URLs

  • A client-server system based around port 80 and the HTTP. When a server gets a request it is usually to send out a webpage from a directory on the server.
  • The file is usually referenced using a Uniform Resource Locator (URL).
  • http://www.w3.org:80/People/Berners-Lee/Overview.html
  • A URL represents another file on the network. It's comprised of...
  • A method for locating information - i.e. a transmission protocol, e.g. the HyperText Transmission Protocol (http).
  • A host machine name, e.g. www.w3.org
  • A path to a file on that server, e.g. /People/Berners-Lee/Overview.html
  • A port to connect to that server on, e.g. http connects to port 80.

The Web

  • Web pages consist of text that will be displayed and tags that won't, which include formatting details or references to other files like images or javascript code.
  • The tags are referred to as the HyperText Markup Language (HTML).
  • Saved as text files with the suffix .html (or sometimes .htm). Note that if the filename is missing from the URL, the default servers will look to send is index.html
  • You can also look at webpages "locally", that is directly on your harddrive, though some elements may not work properly unless served, especially those involved in downloading data.
  • A basic webpage:
    <HTML>

        <HEAD>
            <TITLE>Title for top of browser</TITLE>
        </HEAD>

        <BODY>
            <!--Stuff goes here; this is a comment-->
        </BODY>

    </HTML>
  • The HEAD contains information about the page, the BODY contains the actual information to make the page. Note tags are not case sensitive.

  • Basic tags:
    <BODY>
        The break tag breaks a line<BR />
        like that.

        <P>
        The paragraph tags
        </P>

        leave a line.
        This is <B>Bold</B>.
        This is <I>Italic</I>.

        <IMG src="tim.gif" alt="Photo: Pic of Tim" width="50" height="50"></IMG>

        <A href="index.html">Link text</A>
    </BODY>
  • The text in the file will only be shown with the format set out in the tags. Any line breaks etc. won't show up on screen.
  • Tags can have 'attributes', like the href attribute in the Anchor tag "A".

Tables

  • A lot of data is held in tables:
    <TABLE>
    <TR><TH>y</TH><TH>x</TH><TH>z</TH></TR>
    <TR><TD>2</TD><TD>5</TD><TD>3</TD></TR>
    <TR><TD>4</TD><TD>3</TD><TD>0</TD></TR>
    <TR><TD>3</TD><TD>1</TD><TD>5</TD></TR>
    </TABLE>
y x z
2 5 3
4 3 0
3 1 5

Document Object Model (DOM)

  • As tags are nested, HTML can be thought of as a tree structure called the Document Object Model (DOM).
  • Each element is a child of some parent.
  • Document has a root.
  • We can regard each element containing text as containing some innerHTML.
DOM tree

Classes and IDs

  • Elements may be given classes (generic groupings) and IDs (names specific to themselves) as attributes.
    <TABLE class="datatable" id="yxz">
    <TR><TD class='y'>73</TD></TR>

Cascading Style Sheets

  • In general we try to separate out the look of websites from their content.
  • The look is stored in something called a Cascading Style Sheet (CSS). These link elements with looks.
  • They are linked to the HTML in the HEAD with the following tag:
    <link rel="stylesheet" href="http://www.geog.leeds.ac.uk/courses/computing/css/doublePage.css">
    or if in same directory:
    <link rel="stylesheet" href="doublePage.css">
  • Example:
    /* All tables */
    TABLE {
        border: 1px solid black;
    }
    /* All tables of class datatable */
    TABLE.datatable {
        margin: 10px;
    }
    /* Tables of ID yxz */
    TABLE#yxz {
        background-color: white;
    }
    /* TD in Tables of ID yxz */
    TABLE#yxz td {
        padding: 10px;
    }

Good web design

  • Like any GUI, good web design concentrates on usability.
  • There are a number of websites that can help you - these are listed on the links page for this course.
  • See also the tutorial on the course pages.

Web accessibility

  • If you are working for a public organisation, accessibility for the disabled has to be a major design driver.
  • Generally you can make webpages accessible by not putting important information in images and sound.