Category Archives: PhD

PhDen, en beskrivelse

PhDen jeg tar er en såkalt “NæringslivePhD”. Som en del av dette krever Forskningsrådet at jeg årlig skriver og sender inn en “populærvitenskaplig” oppsummering av prosjektet. Jeg har en følelse av at dette bare havner i en skuff, så da er det kanskje ikke så dumt å legge den her. Det er vel større sjangse for at noen leser den her enn om den ligger i en skuff.

Dog, å skrive oppsummeringer av hva jeg gjør er ikke prioritert over å gjøre det jeg faktisk skal gjøre, så det er en litt sånn haste-skrevet tekst dette. Med det i bakhodet håper jeg kanskje det er oppklarene for de (om noen) som lurer på hva jeg driver med.

Kartdata er en viktig komponent i en rekke tjenester. Byggesak, byutvikling, krisehåndtering, eiendomssalg, og markedsføring er bare noen eksempler. Tilgangen på kartdata er også større enn noen gang. En rekke offentlige instanser kartlegger og frigir sine kartdata under en åpen lisens. Privatpersoner bidrar også med data, både bevisst gjennom dugnadsprosjekter som OpenStreetMap (OSM) og mer ubevisst gjennom bruk av tjenester som Google Maps.

For å bruke disse dataene til å lage nye, digitale tjenester må de samles, forvaltes og gjøres tilgjengelig for utviklere. Dette byr på en rekke utfordringer, knyttet til dataenes kvalitet, nøyaktighet, kompletthet og oppdateringsfrekvens i tillegg til mer tekniske utfordringer som dataformater og datastrukturer.

Jeg jobber med en rekke problemstillinger knyttet til dette. Et aspekt er hvordan vi kan kvalitetssikre kartdata som importeres fra et system til et annet. For eksempel er det en mulighet at de mest detaljerte kartdataene fra det offentlige Norge (FKB) gjøres tilgjengelig under en åpen lisens. I det tilfelle vil det være interessant å importere hus-omriss fra FKB til OSM. Utfordringen er at en god del hus-omriss allerede er kartlagt av frivillige i OSM, og en overskriving av disse vil føre til at data kan gå tapt. En løsning er å manuelt sjekke alle kollisjoner, men dette utgjør om lag 1 million hus-omriss. Vi fant, at ved å splitte denne sjekken opp i små deler, som tilrettelegges og distribueres over internett, såkalt micro-tasking, kan denne jobben gjøres effektivt, selv med deltakere uten erfaring med å jobbe med kartdata.

Et annet aspekt er lagringsstrategier for store mengder kartdata fra forskjellige kilder, med forskjellige formater på atributt-data. Såkalte NoSQL-databaser er en fristende strategi for lagring av disse, da dataene som lagres ikke trenger å bruke samme skjema. Men, disse databasene har også en rekke ulemper. Ved å kjøre et eksperiment på lagring av en stor mengde kartdata i både en NoSQL-databasestruktur og en tradisjonell relasjonell databasestruktur fant jeg at både lagring og uthenting er tregere med en NoSQL-struktur. Dette betyr at med litt mer forarbeid kan man oppnå bedre ytelse og plassutnyttelse ved bruk av relasjonelle databaser, som også gjør bruk av dataene enklere.

Det større spørsmålet dette bygger opp under er hvordan man på en best mulig måte bygger opp et system som henter geografiske data fra en rekke kilder og lagrer det, slik at en kan bygge tjenester på toppen av dette. Her ser jeg på å bruke «event sourcing», et konsept der kun endringer på enkeltgeometrier distribueres når noe faktisk endres. I dag oppdateres de aller fleste geografiske datasett ved at det ved jevnlige intervaller publiseres en oppdatert versjon av datasettet, hvor ofte kun 10-20% av dataene er endret. Ved å se på mekanismer fra kildekontroll og samarbeidsverktøy undersøker jeg hvordan man mest effektivt kan finne endringer og beskrive disse, slik at de kan distribueres til en eller flere konsumenter av dataene. På denne måten har man både en oppdatert representasjon av kartdataene slik de sist var kartlagt, men også en fullstendig historisk oversikt over hva som var kartlagt på et gitt tidspunkt. I tillegg åpner dette for muligheter til å reagere på hendelser når noe faktisk endres.

Målet med dette doktorgradsarbeidet er å skissere et slikt system for distribusjon og bruk av event sourcing som kan settes opp på en skyplattform. Dette inkluderer å utvikle komponentene som inngår, samt skissere utfordringer knyttet til data-integritet og -sikkerhet.

How’s the PhD going?

Leave a reply

“Don’t ask”

“There is light at the end of the tunnel, but I’m not sure if it’s a train”

“Well, at least it’s not going backwards”

“Let’s talk about something else, shall we?”

All these are perfectly valid responses to the above question. They are more polite than “none of your business” and all contains a grain of truth.

However, in the interest of a few (zero), I thought as well I might share some details here. Mainly motivated by the fact that in a couple of months I’ve managed to get two articles published. Which is kinda a big deal.

So, what’s been published then?

First there is “Micro-tasking as a method for human assessment and quality control in a geospatial data import“, published in “Cartography and Geographic Information Science” (or CaGIS). This article is based on Anne Sofies Master thesis, but has been substantially reworked in order to pass a scientific article. The premise is quite simple: How can microtasking be used to aid in an import of geospatial data to i.e. OpenStreetMap. Or, as the abstract puts it:

Crowd-sourced geospatial data can often be enriched by importing open governmental datasets as long as they are up-to date and of good quality. Unfortunately, merging datasets is not straight forward. In the context of geospatial data, spatial overlaps pose a particular problem, as existing data may be overwritten when a naïve, automated import strategy is employed. For example: OpenStreetMap has imported over 100 open geospatial datasets, but the requirement for human assessment makes this a time-consuming process which requires experienced volunteers or training. In this paper, we propose a hybrid import workflow that combines algorithmic filtering with human assessment using the micro-tasking method. This enables human assessment without the need for complex tools or prior experience. Using an online experiment, we investigated how import speed and accuracy is affected by volunteer experience and partitioning of the micro-task. We conclude that micro-tasking is a viable method for massive quality assessment that does not require volunteers to have prior experience working with geospatial data.

This article is behind the famous scholarly paywall, but If you want to read it we’ll work something out.

What did I learn from this? Well, statistics is hard. And complicated. And KEEP ALL YOUR DATA! And the review process is designed to drain the life our of you.

The second article was published a couple of days ago, in “Journal of Big Data”. Its’s titled “Efficient storage of heterogeneous geospatial data in spatial databases“, and here I am the sole author. The premise? Is NoSQL just a god-damn fad for lazy developers with a fear of database schemas? The conclusion? Pretty much. And PostGIS is cool. Or, in more scholarly terms:

The no-schema approach of NoSQL document stores is a tempting solution for importing heterogenous geospatial data to a spatial database. However, this approach means sacrificing the benefits of RDBMSes, such as existing integrations and the ACID principle. Previous comparisons of the document-store and table-based layout for storing geospatial data favours the document-store approach but does not consider importing data that can be segmented into homogenous datasets. In this paper we propose “The Heterogeneous Open Geodata Storage (HOGS)” system. HOGS is a command line utility that automates the process of importing geospatial data to a PostgreSQL/PostGIS database. It is developed in order to compare the performance of a traditional storage layout adhering to the ACID principle, and a NoSQL-inspired document store. A collection of eight open geospatial datasets comprising 15 million features was imported and queried in order to compare the differences between the two storage layouts. The results from a quantitative experiment are presented and shows that large amounts of open geospatial data can be stored using traditional RDBMSes using a table-based layout without any performance penalties.

This article is by the way Open Access (don’t as how much that cost, just rest ensured that in the end it’s all taxpayer money), so go ahead and read the whole thing if this tickles your fancy. An there is Open Source code as well, available here: github.com/atlefren/HOGS. Some fun facts about this article:

I managed to create a stupid acronym: HOGS
The manuscript was first left in a drawer for five months, before the editor decided it wasn’t fit for the journal

The next journal provided such great reviews as

If you are importing a relatively static dataset such as the toppological dataset of Norway does it really matter if the import takes 1 hr 19 mins vrs 3 hours? It is very likely that this import will only be performed once every couple of months minimum. A DB admin is likely to set this running at night time and return in the morning to an imported dataset.

and

You are submitting your manuscript as “research article” to a journal that only cares about original research and not technical articles or database articles. For this reason, you need to justify the research behind it. The current state of the paper looks like a technical report. Again, an interesting and well-written one, but not a research article.

And the last reviewer (#2, why is it always reviewer #2?) who did not like the fact that I argued with him instead of doing what he said, and whose last comments was that I should add a section: “structure of the paper”. Well, I like the fact that some quality control is applied, but this borders the ridiculous.

Well, so there you have it three articles down (this was the first), at least one to go.

Speaking of. The next article is in the works. I’ve written the code, started writing the article and am gathering data to benchmark the thing. I need versioned geospatial data, and after a while I found out that OpenStreetMap data fits the bill. After some failed attempts using osm2pgsql and FME (which both silently ignore the history), I had to roll my own. Osmium seemed like it could do the trick, but by C++-skills are close to non.existent. Fortunately there is pyosmium, a Python wrapper. After spending a lot of time chasing memory leaks, I found that osmium is _really_ memory-hungry. So, using a cache-file might do the trick. I might do a write-up on this process when (if?) it finishes, but if you’re interested the source code is available on GitHub.

So, yeah. Thats it for now, check back next decade for the next update!

Please seek professional English editing services

Mendeley is dead, long live Zotero!

2 Replies

When I started out on my PhD two years ago I found Mendeley and thought it a perfect reference manager: Free to use, integrated with both MS Word and my browser and a generally easy-to-use GUI. What’s not to like?

Fast-forward two years. One of my papers was rejected and in the process of re-submitting it I needed to re-format the bibliography (more on that frustration in another post). Then Mendeley started acting up: “There was a problem setting up Word plugin communication: The address is protected”. Wtf? I re-installed the Word Plugin, I re-installed Mendeley itself, I tried some hints from this blog, I even watched a couple of YouTube videos. All to no avail. The Mendeley Word plugin did not work!

So I did what I usually do when life is mean to me: I took to Twitter. And complained. The Mendeley team was quick to answer, but their troubleshooting as nothing more than what I already had tried, plus encouraging me to “turn it off and then on again”. Nothing worked. A bit frustrated I replied:

Ok: how do I migrate my data away from Mendeley, and what is the best alternative to Mendeley?

The next day I had no reply and send a more official support request, and was met with this gem:

Dear Customer,

Thank you for submitting your question. This is to confirm that we have received your request and we aim to respond to you within 24 hours.

However, please note our current response time is 5 days.

Ok. Fuck this. I then remember hearing about Zotero, an Open Source reference manager. It seemed to offer both a Word-plugin and browser extension, as well as a method for importing my Mendeley data. Upon installation I chose “import from Mendeley” and found that it was not possible, due to encryption. I then found this site and found yet another reason to migrate away from Mendeley. Luckily my latest backup lacked only 20 items or so, so after 10 minutes of wrangling I had imported all of my data.

And I was impressed: Zotero understood that my Word doc was previously managed by Mendeley, and I did not have to change out all my references and rebuild the bibliography. So, in 30 minutes or so I had a working reference manager again, and I’ve moved from a closed platform incapable of providing adequate support to an open alternative that seems to work great!

So: if you are having trouble at all with Mendeley I would strongly suggest to migrate to Zotero!

The Open Geospatial Data Ecosystem

1 Reply

This summer my first peer-reviewed article, “The Open Geospatial Data Ecosystem”, was published in “Kart og plan”. Unfortunately, the journal is not that digital, and they decided to withhold the issue from the web for a year, “in order to protect the printed version”. What?!

However, I was provided a link to a pdf of my article, and told I could distribute it. I interpret this as an approval of me publishing the article on my blog, so that is exactly what I’ll do.

The full article can be downloaded here: http://docs.atlefren.net/ogde.pdf, and the abstract is provided here:

Open Governmental Data, Linked Open Data, Open Government, Volunteered Geographic Information, Participatory GIS, and Free and Open Source Software are all parts of The Open Geospatial Data Ecosystem. How do these data types shape what we define as Open Geospatial Data; Open Data of a geospatial nature? While all these areas are well described in the literature, there is a lack of a formal definition and exploration of the concept of Open Geospatial Data as a whole. A review of current research, case-studies, and real-world examples, such as OpenStreetMap, reveal some common features; governments are a large source of open data due to their historical role and as a result of political pressure on making data public, and the large role volunteers play both in collecting and managing open data and in developing open source tools. This article provides a common base for discussion. Open Geospatial data will be even more important as it matures and more governments and corporations release and use open data.

Hc Svnt Dracones

Rablerier fra en hurragutt med mye sprøtt i nøtta