Text file formats

Dr Andy Evans

[Fullscreen]

CSV

  • Classic format Comma Separated Variables (CSV).
  • Easily parsed.
  • No information added by structure, so an ontology (in this case meaning a structured knowledge framework) must be externally imposed.
  • We've seen one way to read this.
10,10,50,50,10
10,50,50,10,10
25,25,75,75,25
25,75,75,25,25
50,50,100,100,50
50,100,100,50,50

csv.reader

  • Easy CSV reading:
    import csv
    f = open('data.csv', newline='')
    reader = csv.reader(f, quoting=csv.QUOTE_NONNUMERIC)
    for row in reader: # A list of rows
        for value in row: # A list of value
            print(value) # Floats
    f.close() # Don't close until you are done with the reader;
            # the data is read on request.
  • The kwarg quoting=csv.QUOTE_NONNUMERIC converts numbers into floats. Remove to keep the data as strings.
  • Note that there are different dialects of csv which can be accounted for:
    https://docs.python.org/3/library/csv.html
  • For example, add dialect='excel-tab' to the reader to open tab-delimited files.

csv.writer

  • Easy CSV write:
    f2 = open('dataout.csv', 'w', newline='')
    writer = csv.writer(f2, delimiter=' ')
    for row in data:
        writer.writerow(row) # List of values.
    f2.close()
  • The optional delimiter here creates a space delimited file rather than csv.

JSON

  • Designed to capture JavaScript objects.
  • Increasing popular light-weight data format.
  • Text attribute and value pairs.
  • Values can include more complex objects made up of further attribute-value pairs.
  • Easily parsed.
  • Small(ish) files.
  • Limited structuring opportunities.
  • GeoJSON example
    {
        "type": "FeatureCollection",
        "features": [ {
            "type": "Feature",
            "geometry": {
                "type": "Point",
                "coordinates": [42.0, 21.0]
            },
            "properties": {
                "prop0": "value0"
            }
        }]
    }

Markup languages

  • Tags and content.
  • Tags often note the ontological context of the data, making the value have meaning: that is determining its semantic content.
  • All based on Standard Generalized Markup Language (SGML) [ISO 8879]

HTML (Hypertext Markup Language)

  • Nested tags giving information about the content.
    <HTML>
        <BODY>
            <P><B>This</B> is<BR>text
        </BODY>
    </HTML>
  • Note that tags can be on their own, some by default, some through sloppiness.
  • Not case sensitive.
  • Contains style information (though use discouraged).

XML (eXtensible Markup Language)

  • More generic.
  • Extensible - not fixed terms, but terms you can add to.
  • Vast number of different versions for different kinds of information.
  • Used a lot now because of the advantages of using human-readable data formats. Data transfer fast, memory cheap, and it is therefore now feasible.

GML

  • Major geographical type is GML (Geographical Markup Language).
  • Given a significant boost by the shift of Ordnance Survey from their own binary data format to this.
  • Controlled by the Open GIS Consortium:
    http://www.opengeospatial.org/standards/gml

    <gml:Point gml:id="p21"
    srsName="http://www.opengis.net/def/crs/EPSG/0/4326">
    <gml:coordinates>45.67, 88.56 </gml:Point>

JSON Read

import json

f = open('data.json')
data = json.load(f)
f.close()
print(data)
print(data["features"])
print(data["features"][0]["geometry"])

for i in data["features"]:
print(i["geometry"]["coordinates"][0])
Numbers are converted to floats etc.
  • GeoJSON example
    {
        "type": "FeatureCollection",
        "features": [ {
            "type": "Feature",
            "geometry": {
                "type": "Point",
                "coordinates": [42.0, 21.0]
            },
            "properties": {
                "prop0": "value0"
            }
        }]
    }
{'features':
[
{'type': 'Feature', 'geometry':
    {'coordinates': [42.0, 21.0], 'type': 'Point'},
    'properties': {'prop0': 'value0'}
}
],
'type': 'FeatureCollection'}

Conversions

JSON:
object
array
string
number (int)
number (real)
true
false
null
Python:
dict
list
str
int
float
True
False
None

JSON write

  • Easy JSON writer:
    import json

    f = open('data.json')
    data = json.load(f)
    f.close()

    f = open('out.json', 'w')
    json.dump(data, f)
    f.close()

Serialisation

  • Serialisation is the converting of code objects to a storage format; usually some kind of file.
  • Marshalling in Python is essentially synonymous, though in other languages has slightly different uses (for example, in Java marshalling and object may involve additional storage of generic object templates).
  • Deserialisation (~unmarshalling): the conversion of storage-format objects back into working code.
  • The json code essentially does this for simple and container Python variables.
  • For more complicated objects, see pickle: https://docs.python.org/3/library/pickle.html

Formatted printing

  • json.loads and json.dumps convert Python objects to JSON strings.
  • Dumps has a nice print formatting option:
    print(json.dumps(data["features"][0]["geometry"], sort_keys=True, indent=4))

    "geometry": {
        "coordinates": [
            42.0,
            21.0
        ],
        "type": "Point"
    }
  • More on the JSON library at: https://docs.python.org/3/library/json.html

JSON checking tool

  • Run at the command line:
    python -m json.tool < data.json
  • Will print the JSON if right, or suggest corrections.

HTML / XML