Category Archives: English

How’s the PhD going?

“Don’t ask”

“There is light at the end of the tunnel, but I’m not sure if it’s a train”

“Well, at least it’s not going backwards”

“Let’s talk about something else, shall we?”

All these are perfectly valid responses to the above question. They are more polite than “none of your business” and all contains a grain of truth.

However, in the interest of a few (zero), I thought as well I might share some details here. Mainly motivated by the fact that in a couple of months I’ve managed to get two articles published. Which is kinda a big deal.

So, what’s been published then?

First there is “Micro-tasking as a method for human assessment and quality control in a geospatial data import“, published in “Cartography and Geographic Information Science” (or CaGIS). This article is based on Anne Sofies Master thesis, but has been substantially reworked in order to pass a scientific article. The premise is quite simple: How can microtasking be used to aid in an import of geospatial data to i.e. OpenStreetMap. Or, as the abstract puts it:

Crowd-sourced geospatial data can often be enriched by importing open governmental datasets as long as they are up-to date and of good quality. Unfortunately, merging datasets is not straight forward. In the context of geospatial data, spatial overlaps pose a particular problem, as existing data may be overwritten when a naïve, automated import strategy is employed. For example: OpenStreetMap has imported over 100 open geospatial datasets, but the requirement for human assessment makes this a time-consuming process which requires experienced volunteers or training. In this paper, we propose a hybrid import workflow that combines algorithmic filtering with human assessment using the micro-tasking method. This enables human assessment without the need for complex tools or prior experience. Using an online experiment, we investigated how import speed and accuracy is affected by volunteer experience and partitioning of the micro-task. We conclude that micro-tasking is a viable method for massive quality assessment that does not require volunteers to have prior experience working with geospatial data.

This article is behind the famous scholarly paywall, but If you want to read it we’ll work something out.

What did I learn from this? Well, statistics is hard. And complicated. And KEEP ALL YOUR DATA! And the review process is designed to drain the life our of you.

The second article was published a couple of days ago, in “Journal of Big Data”. Its’s titled “Efficient storage of heterogeneous geospatial data in spatial databases“, and here I am the sole author. The premise? Is NoSQL just a god-damn fad for lazy developers with a fear of database schemas? The conclusion? Pretty much. And PostGIS is cool. Or, in more scholarly terms:

The no-schema approach of NoSQL document stores is a tempting solution for importing heterogenous geospatial data to a spatial database. However, this approach means sacrificing the benefits of RDBMSes, such as existing integrations and the ACID principle. Previous comparisons of the document-store and table-based layout for storing geospatial data favours the document-store approach but does not consider importing data that can be segmented into homogenous datasets. In this paper we propose “The Heterogeneous Open Geodata Storage (HOGS)” system. HOGS is a command line utility that automates the process of importing geospatial data to a PostgreSQL/PostGIS database. It is developed in order to compare the performance of a traditional storage layout adhering to the ACID principle, and a NoSQL-inspired document store. A collection of eight open geospatial datasets comprising 15 million features was imported and queried in order to compare the differences between the two storage layouts. The results from a quantitative experiment are presented and shows that large amounts of open geospatial data can be stored using traditional RDBMSes using a table-based layout without any performance penalties.

This article is by the way Open Access (don’t as how much that cost, just rest ensured that in the end it’s all taxpayer money), so go ahead and read the whole thing if this tickles your fancy. An there is Open Source code as well, available here: github.com/atlefren/HOGS. Some fun facts about this article:

I managed to create a stupid acronym: HOGS
The manuscript was first left in a drawer for five months, before the editor decided it wasn’t fit for the journal

The next journal provided such great reviews as

If you are importing a relatively static dataset such as the toppological dataset of Norway does it really matter if the import takes 1 hr 19 mins vrs 3 hours? It is very likely that this import will only be performed once every couple of months minimum. A DB admin is likely to set this running at night time and return in the morning to an imported dataset.

and

You are submitting your manuscript as “research article” to a journal that only cares about original research and not technical articles or database articles. For this reason, you need to justify the research behind it. The current state of the paper looks like a technical report. Again, an interesting and well-written one, but not a research article.

And the last reviewer (#2, why is it always reviewer #2?) who did not like the fact that I argued with him instead of doing what he said, and whose last comments was that I should add a section: “structure of the paper”. Well, I like the fact that some quality control is applied, but this borders the ridiculous.

Well, so there you have it three articles down (this was the first), at least one to go.

Speaking of. The next article is in the works. I’ve written the code, started writing the article and am gathering data to benchmark the thing. I need versioned geospatial data, and after a while I found out that OpenStreetMap data fits the bill. After some failed attempts using osm2pgsql and FME (which both silently ignore the history), I had to roll my own. Osmium seemed like it could do the trick, but by C++-skills are close to non.existent. Fortunately there is pyosmium, a Python wrapper. After spending a lot of time chasing memory leaks, I found that osmium is _really_ memory-hungry. So, using a cache-file might do the trick. I might do a write-up on this process when (if?) it finishes, but if you’re interested the source code is available on GitHub.

So, yeah. Thats it for now, check back next decade for the next update!

Please seek professional English editing services

Mendeley is dead, long live Zotero!

2 Replies

When I started out on my PhD two years ago I found Mendeley and thought it a perfect reference manager: Free to use, integrated with both MS Word and my browser and a generally easy-to-use GUI. What’s not to like?

Fast-forward two years. One of my papers was rejected and in the process of re-submitting it I needed to re-format the bibliography (more on that frustration in another post). Then Mendeley started acting up: “There was a problem setting up Word plugin communication: The address is protected”. Wtf? I re-installed the Word Plugin, I re-installed Mendeley itself, I tried some hints from this blog, I even watched a couple of YouTube videos. All to no avail. The Mendeley Word plugin did not work!

So I did what I usually do when life is mean to me: I took to Twitter. And complained. The Mendeley team was quick to answer, but their troubleshooting as nothing more than what I already had tried, plus encouraging me to “turn it off and then on again”. Nothing worked. A bit frustrated I replied:

Ok: how do I migrate my data away from Mendeley, and what is the best alternative to Mendeley?

The next day I had no reply and send a more official support request, and was met with this gem:

Dear Customer,

Thank you for submitting your question. This is to confirm that we have received your request and we aim to respond to you within 24 hours.

However, please note our current response time is 5 days.

Ok. Fuck this. I then remember hearing about Zotero, an Open Source reference manager. It seemed to offer both a Word-plugin and browser extension, as well as a method for importing my Mendeley data. Upon installation I chose “import from Mendeley” and found that it was not possible, due to encryption. I then found this site and found yet another reason to migrate away from Mendeley. Luckily my latest backup lacked only 20 items or so, so after 10 minutes of wrangling I had imported all of my data.

And I was impressed: Zotero understood that my Word doc was previously managed by Mendeley, and I did not have to change out all my references and rebuild the bibliography. So, in 30 minutes or so I had a working reference manager again, and I’ve moved from a closed platform incapable of providing adequate support to an open alternative that seems to work great!

So: if you are having trouble at all with Mendeley I would strongly suggest to migrate to Zotero!

Vsts: setting up tests and coverage to run on build for Javascript projects

The SOSI-format: The crazy, Norwegian, geospatial file format

1 Reply

Imagine trying to coordinate the exchange of geospatial data long before the birth of the Shapefile, before XML and JSON was thought of. Around the time when “microcomputers” was _really_ new, and mainframes was the norm. Before I was born.

Despite this (rather sorry) state of affairs, you realize that the “growing use of digital methods used in the production and use of geospatial data raises several coordiantion-issues” [my translation]. In addition, “there is an expressed wish from both software companies and users of geospatial data that new developments does not lead to a chaos of digital information that cannot be used without in-depth knowledge and large investments in software” [my translation].

Pretty forward-thinking if you ask me. Who was thinking about this in 1980? Turns out that two Norwegians, Stein W. Bie and Einar Stormark, did this in 1980, by writing this report.

This report is fantastic. It’s the first hint of a format that Norwegians working with geospatial data (and few others) still has to relate to today. The format, known as the “SOSI-Format” (not to be confused with the SOSI Standard) is a plaintext format for representing points, lines, areas and a bunch of other shapes, in addition to attribute data.

My reaction when I first encountered this format some 8 years ago was “what the hell is this?”, and I started on a crusade to get rid of the format (“there surely are better formats”). But I was hit by a lot of resistance. Partly because I confused the format with the standard, partly because I was young and did not know my history, partly because the format is still in widespread use, and partly because the format is in many ways really cool!

So, I started reading up on the format a bit (and made a parser for it in JavaScript, sosi.js). One thing that struck me was that a lot of things I’ve seen popping up lately has been in the SOSI-format for ages. Shared borders (as in TopoJSON) Check! Local origins (to save space) Check! Complex geometries (like arcs etc) Check!

But, what is it like? It’s a file written in what’s referred to as “dot-notation” (take a look at this file and you’ll understand why). The format was inspired by the british/canadion format FILEMATCH and a french database-system called SIGMI (anyone?).

The format is, as stated, text (i.e. ASCII) based, with the reason was that this ensured that data could be stored and transferred on a wide range of media. At the time of writing the report, there existed FORTRAN-implementations (for both Nord-10/S and UNIVAC 1100) for reading and writing. Nowadays, there exists several closed-source readers and writes for the format (implemented by several Norwegian GIS vendors), in addition to several Open Source readers.

The format is slated for replacement by some GML-variation, but we are still waiting. There is also GDAL/OGR support for the format, courtesy of Statens Kartverk. However, this requires a lot of hoop-jumping and make-magic on Linux. In addition, the current implementation does not work with utf-8, which is a major drawback as most .SOS-files these days are in fact utf-8.

So, there we are. The official Norwegian format for exchange of geographic information in 2018 is a nearly 40 year old plain text format. And the crazy thing is that this Norwegian oddity is actually something other countries are envious about, as we actually have a common, open (!), standard for this purpose, not some de-facto reverse-engineered binary format.

And, why indeed, why should the age of a format be an issue, as long as it works?

Hc Svnt Dracones

Rablerier fra en hurragutt med mye sprøtt i nøtta