Monthly Archives: November 2019

PhDen, en beskrivelse

PhDen jeg tar er en såkalt “NæringslivePhD”. Som en del av dette krever Forskningsrådet at jeg årlig skriver og sender inn en “populærvitenskaplig” oppsummering av prosjektet. Jeg har en følelse av at dette bare havner i en skuff, så da er det kanskje ikke så dumt å legge den her. Det er vel større sjangse for at noen leser den her enn om den ligger i en skuff.

Dog, å skrive oppsummeringer av hva jeg gjør er ikke prioritert over å gjøre det jeg faktisk skal gjøre, så det er en litt sånn haste-skrevet tekst dette. Med det i bakhodet håper jeg kanskje det er oppklarene for de (om noen) som lurer på hva jeg driver med.

Kartdata er en viktig komponent i en rekke tjenester. Byggesak, byutvikling, krisehåndtering, eiendomssalg, og markedsføring er bare noen eksempler. Tilgangen på kartdata er også større enn noen gang. En rekke offentlige instanser kartlegger og frigir sine kartdata under en åpen lisens. Privatpersoner bidrar også med data, både bevisst gjennom dugnadsprosjekter som OpenStreetMap (OSM) og mer ubevisst gjennom bruk av tjenester som Google Maps.

For å bruke disse dataene til å lage nye, digitale tjenester må de samles, forvaltes og gjøres tilgjengelig for utviklere. Dette byr på en rekke utfordringer, knyttet til dataenes kvalitet, nøyaktighet, kompletthet og oppdateringsfrekvens i tillegg til mer tekniske utfordringer som dataformater og datastrukturer.

Jeg jobber med en rekke problemstillinger knyttet til dette. Et aspekt er hvordan vi kan kvalitetssikre kartdata som importeres fra et system til et annet. For eksempel er det en mulighet at de mest detaljerte kartdataene fra det offentlige Norge (FKB) gjøres tilgjengelig under en åpen lisens. I det tilfelle vil det være interessant å importere hus-omriss fra FKB til OSM. Utfordringen er at en god del hus-omriss allerede er kartlagt av frivillige i OSM, og en overskriving av disse vil føre til at data kan gå tapt. En løsning er å manuelt sjekke alle kollisjoner, men dette utgjør om lag 1 million hus-omriss. Vi fant, at ved å splitte denne sjekken opp i små deler, som tilrettelegges og distribueres over internett, såkalt micro-tasking, kan denne jobben gjøres effektivt, selv med deltakere uten erfaring med å jobbe med kartdata.

Et annet aspekt er lagringsstrategier for store mengder kartdata fra forskjellige kilder, med forskjellige formater på atributt-data. Såkalte NoSQL-databaser er en fristende strategi for lagring av disse, da dataene som lagres ikke trenger å bruke samme skjema. Men, disse databasene har også en rekke ulemper. Ved å kjøre et eksperiment på lagring av en stor mengde kartdata i både en NoSQL-databasestruktur og en tradisjonell relasjonell databasestruktur fant jeg at både lagring og uthenting er tregere med en NoSQL-struktur. Dette betyr at med litt mer forarbeid kan man oppnå bedre ytelse og plassutnyttelse ved bruk av relasjonelle databaser, som også gjør bruk av dataene enklere.

Det større spørsmålet dette bygger opp under er hvordan man på en best mulig måte bygger opp et system som henter geografiske data fra en rekke kilder og lagrer det, slik at en kan bygge tjenester på toppen av dette. Her ser jeg på å bruke «event sourcing», et konsept der kun endringer på enkeltgeometrier distribueres når noe faktisk endres. I dag oppdateres de aller fleste geografiske datasett ved at det ved jevnlige intervaller publiseres en oppdatert versjon av datasettet, hvor ofte kun 10-20% av dataene er endret. Ved å se på mekanismer fra kildekontroll og samarbeidsverktøy undersøker jeg hvordan man mest effektivt kan finne endringer og beskrive disse, slik at de kan distribueres til en eller flere konsumenter av dataene. På denne måten har man både en oppdatert representasjon av kartdataene slik de sist var kartlagt, men også en fullstendig historisk oversikt over hva som var kartlagt på et gitt tidspunkt. I tillegg åpner dette for muligheter til å reagere på hendelser når noe faktisk endres.

Målet med dette doktorgradsarbeidet er å skissere et slikt system for distribusjon og bruk av event sourcing som kan settes opp på en skyplattform. Dette inkluderer å utvikle komponentene som inngår, samt skissere utfordringer knyttet til data-integritet og -sikkerhet.

Efficient PostgreSQL/PostGIS imports from Python using copy statements

For some reason I’ve had to import fairly large chunks of data into PostgreSQL/PostGIS using Python lately. When I wrote the HOGS app I decided that COPY statements was the way to go, as this is a fast and reliable way of getting large chunks of data into PostgreSQL.

But, as you might know, “COPY moves data between PostgreSQL tables and standard file-system files. ” [1]. Ugh, those pesky files. My data isn’t in files and shouldn’t be.

So I started googling and found this gist. That seemed to do the trick: Works with Python, uses psychopg2, and eliminates the need for a file. All the same while allowing me to use COPY statements. All good?

Well, almost. PostGIS is a bit picky. You need to provide the geometries as WKB. But, the WKB standard does not specify srid. So, if your table is defined with a column Geometry(4326), you’ll get an error. The solution? Use EWKB (the e is for extended). Using a tool such as pygeos lets you add srid to a WBK-encoded geometry like this:

from pygeos import to_wkb, from_wkb, set_srid

def add_srid(wkb, srid=4326):
    return to_wkb(
        set_srid(from_wkb(wkb), srid),
        hex=True,
        include_srid=True
    )
    

Another issue, worth considering when importing, say several million geometries, is that you probably want some commits in between. Mocking a bit about with generators, i found a way to split a generator into a generator that yields generators, so that you could iterate over, say 10 000 elements and hit a break. Something like this:

def split_generator(gen, elements_pr_generator):
    is_running = True

    def generator(gen):
        nonlocal is_running
        c = 0
        while c < elements_pr_generator:
            try:
                n = next(gen)
                yield n
            except StopIteration as e:
                is_running = False
                raise e
            c += 1

    c = 0
    while is_running:
        if c % elements_pr_generator == 0:
            yield generator(gen)
        c += 1

One last thing you might end up getting bitten by is the fact that the copy statement expects a file. And a file consists of text. And if you are inserting, say a dictionary to a json(b)-column, you might end up with errors. Same for None-values. In order to handle this, i wrote a line-generator, that takes a record from your generator and transforms it to a proper line:

def format_line(record, column_data):
    line_template = '\t'.join(['%s'] * len(column_data))
    data = []
    for column in column_data:
        key = column['key']
        value = None

        # get the value
        if key in record:
            value = record[key]

        if value is None:
            data.append('\\N')
        elif 'encoder' in column:
            data.append(column['encoder'](value))
        else:
            data.append(value)

    return line_template % tuple(data)

This function takes in a definition for each column, gets the data from the generator data, and formats it as a proper line to be fed to the COPY-statement.

If you combine these elements you end up with a small utility that lets you copy most python data into PostgreSQL with ease. So, if this is something you need, check out pg_geomcopy on github!

How’s the PhD going?

“Don’t ask”

“There is light at the end of the tunnel, but I’m not sure if it’s a train”

“Well, at least it’s not going backwards”

“Let’s talk about something else, shall we?”

All these are perfectly valid responses to the above question. They are more polite than “none of your business” and all contains a grain of truth.

However, in the interest of a few (zero), I thought as well I might share some details here. Mainly motivated by the fact that in a couple of months I’ve managed to get two articles published. Which is kinda a big deal.

So, what’s been published then?

First there is “Micro-tasking as a method for human assessment and quality control in a geospatial data import“, published in “Cartography and Geographic Information Science” (or CaGIS). This article is based on Anne Sofies Master thesis, but has been substantially reworked in order to pass a scientific article. The premise is quite simple: How can microtasking be used to aid in an import of geospatial data to i.e. OpenStreetMap. Or, as the abstract puts it:

Crowd-sourced geospatial data can often be enriched by importing open governmental datasets as long as they are up-to date and of good quality. Unfortunately, merging datasets is not straight forward. In the context of geospatial data, spatial overlaps pose a particular problem, as existing data may be overwritten when a naïve, automated import strategy is employed. For example: OpenStreetMap has imported over 100 open geospatial datasets, but the requirement for human assessment makes this a time-consuming process which requires experienced volunteers or training. In this paper, we propose a hybrid import workflow that combines algorithmic filtering with human assessment using the micro-tasking method. This enables human assessment without the need for complex tools or prior experience. Using an online experiment, we investigated how import speed and accuracy is affected by volunteer experience and partitioning of the micro-task. We conclude that micro-tasking is a viable method for massive quality assessment that does not require volunteers to have prior experience working with geospatial data.

This article is behind the famous scholarly paywall, but If you want to read it we’ll work something out.

What did I learn from this? Well, statistics is hard. And complicated. And KEEP ALL YOUR DATA! And the review process is designed to drain the life our of you.

The second article was published a couple of days ago, in “Journal of Big Data”. Its’s titled “Efficient storage of heterogeneous geospatial data in spatial databases“, and here I am the sole author. The premise? Is NoSQL just a god-damn fad for lazy developers with a fear of database schemas? The conclusion? Pretty much. And PostGIS is cool. Or, in more scholarly terms:

The no-schema approach of NoSQL document stores is a tempting solution for importing heterogenous geospatial data to a spatial database. However, this approach means sacrificing the benefits of RDBMSes, such as existing integrations and the ACID principle. Previous comparisons of the document-store and table-based layout for storing geospatial data favours the document-store approach but does not consider importing data that can be segmented into homogenous datasets. In this paper we propose “The Heterogeneous Open Geodata Storage (HOGS)” system. HOGS is a command line utility that automates the process of importing geospatial data to a PostgreSQL/PostGIS database. It is developed in order to compare the performance of a traditional storage layout adhering to the ACID principle, and a NoSQL-inspired document store. A collection of eight open geospatial datasets comprising 15 million features was imported and queried in order to compare the differences between the two storage layouts. The results from a quantitative experiment are presented and shows that large amounts of open geospatial data can be stored using traditional RDBMSes using a table-based layout without any performance penalties.

This article is by the way Open Access (don’t as how much that cost, just rest ensured that in the end it’s all taxpayer money), so go ahead and read the whole thing if this tickles your fancy. An there is Open Source code as well, available here: github.com/atlefren/HOGS. Some fun facts about this article:

  • I managed to create a stupid acronym: HOGS
  • The manuscript was first left in a drawer for five months, before the editor decided it wasn’t fit for the journal

The next journal provided such great reviews as

If you are importing a relatively static dataset such as the toppological dataset of Norway does it really matter if the import takes 1 hr 19 mins vrs 3 hours? It is very likely that this import will only be performed once every couple of months minimum. A DB admin is likely to set this running at night time and return in the morning to an imported dataset.

and

You are submitting your manuscript as “research article” to a journal that only cares about original research and not technical articles or database articles. For this reason, you need to justify the research behind it. The current state of the paper looks like a technical report. Again, an interesting and well-written one, but not a research article.

And the last reviewer (#2, why is it always reviewer #2?) who did not like the fact that I argued with him instead of doing what he said, and whose last comments was that I should add a section: “structure of the paper”. Well, I like the fact that some quality control is applied, but this borders the ridiculous.

Well, so there you have it three articles down (this was the first), at least one to go.

Speaking of. The next article is in the works. I’ve written the code, started writing the article and am gathering data to benchmark the thing. I need versioned geospatial data, and after a while I found out that OpenStreetMap data fits the bill. After some failed attempts using osm2pgsql and FME (which both silently ignore the history), I had to roll my own. Osmium seemed like it could do the trick, but by C++-skills are close to non.existent. Fortunately there is pyosmium, a Python wrapper. After spending a lot of time chasing memory leaks, I found that osmium is _really_ memory-hungry. So, using a cache-file might do the trick. I might do a write-up on this process when (if?) it finishes, but if you’re interested the source code is available on GitHub.

So, yeah. Thats it for now, check back next decade for the next update!