Category Archives: English

How’s the PhD going?

“Don’t ask”

“There is light at the end of the tunnel, but I’m not sure if it’s a train”

“Well, at least it’s not going backwards”

“Let’s talk about something else, shall we?”

All these are perfectly valid responses to the above question. They are more polite than “none of your business” and all contains a grain of truth.

However, in the interest of a few (zero), I thought as well I might share some details here. Mainly motivated by the fact that in a couple of months I’ve managed to get two articles published. Which is kinda a big deal.

So, what’s been published then?

First there is “Micro-tasking as a method for human assessment and quality control in a geospatial data import“, published in “Cartography and Geographic Information Science” (or CaGIS). This article is based on Anne Sofies Master thesis, but has been substantially reworked in order to pass a scientific article. The premise is quite simple: How can microtasking be used to aid in an import of geospatial data to i.e. OpenStreetMap. Or, as the abstract puts it:

Crowd-sourced geospatial data can often be enriched by importing open governmental datasets as long as they are up-to date and of good quality. Unfortunately, merging datasets is not straight forward. In the context of geospatial data, spatial overlaps pose a particular problem, as existing data may be overwritten when a naïve, automated import strategy is employed. For example: OpenStreetMap has imported over 100 open geospatial datasets, but the requirement for human assessment makes this a time-consuming process which requires experienced volunteers or training. In this paper, we propose a hybrid import workflow that combines algorithmic filtering with human assessment using the micro-tasking method. This enables human assessment without the need for complex tools or prior experience. Using an online experiment, we investigated how import speed and accuracy is affected by volunteer experience and partitioning of the micro-task. We conclude that micro-tasking is a viable method for massive quality assessment that does not require volunteers to have prior experience working with geospatial data.

This article is behind the famous scholarly paywall, but If you want to read it we’ll work something out.

What did I learn from this? Well, statistics is hard. And complicated. And KEEP ALL YOUR DATA! And the review process is designed to drain the life our of you.

The second article was published a couple of days ago, in “Journal of Big Data”. Its’s titled “Efficient storage of heterogeneous geospatial data in spatial databases“, and here I am the sole author. The premise? Is NoSQL just a god-damn fad for lazy developers with a fear of database schemas? The conclusion? Pretty much. And PostGIS is cool. Or, in more scholarly terms:

The no-schema approach of NoSQL document stores is a tempting solution for importing heterogenous geospatial data to a spatial database. However, this approach means sacrificing the benefits of RDBMSes, such as existing integrations and the ACID principle. Previous comparisons of the document-store and table-based layout for storing geospatial data favours the document-store approach but does not consider importing data that can be segmented into homogenous datasets. In this paper we propose “The Heterogeneous Open Geodata Storage (HOGS)” system. HOGS is a command line utility that automates the process of importing geospatial data to a PostgreSQL/PostGIS database. It is developed in order to compare the performance of a traditional storage layout adhering to the ACID principle, and a NoSQL-inspired document store. A collection of eight open geospatial datasets comprising 15 million features was imported and queried in order to compare the differences between the two storage layouts. The results from a quantitative experiment are presented and shows that large amounts of open geospatial data can be stored using traditional RDBMSes using a table-based layout without any performance penalties.

This article is by the way Open Access (don’t as how much that cost, just rest ensured that in the end it’s all taxpayer money), so go ahead and read the whole thing if this tickles your fancy. An there is Open Source code as well, available here: github.com/atlefren/HOGS. Some fun facts about this article:

  • I managed to create a stupid acronym: HOGS
  • The manuscript was first left in a drawer for five months, before the editor decided it wasn’t fit for the journal

The next journal provided such great reviews as

If you are importing a relatively static dataset such as the toppological dataset of Norway does it really matter if the import takes 1 hr 19 mins vrs 3 hours? It is very likely that this import will only be performed once every couple of months minimum. A DB admin is likely to set this running at night time and return in the morning to an imported dataset.

and

You are submitting your manuscript as “research article” to a journal that only cares about original research and not technical articles or database articles. For this reason, you need to justify the research behind it. The current state of the paper looks like a technical report. Again, an interesting and well-written one, but not a research article.

And the last reviewer (#2, why is it always reviewer #2?) who did not like the fact that I argued with him instead of doing what he said, and whose last comments was that I should add a section: “structure of the paper”. Well, I like the fact that some quality control is applied, but this borders the ridiculous.

Well, so there you have it three articles down (this was the first), at least one to go.

Speaking of. The next article is in the works. I’ve written the code, started writing the article and am gathering data to benchmark the thing. I need versioned geospatial data, and after a while I found out that OpenStreetMap data fits the bill. After some failed attempts using osm2pgsql and FME (which both silently ignore the history), I had to roll my own. Osmium seemed like it could do the trick, but by C++-skills are close to non.existent. Fortunately there is pyosmium, a Python wrapper. After spending a lot of time chasing memory leaks, I found that osmium is _really_ memory-hungry. So, using a cache-file might do the trick. I might do a write-up on this process when (if?) it finishes, but if you’re interested the source code is available on GitHub.

So, yeah. Thats it for now, check back next decade for the next update!

 

Please seek professional English editing services

Today I heard back from the journal we submitted our paper to a couple of months ago. The article was rejected, which is to be expected I suppose. At least that is what everyone tells me: expect rejections.

In all fairness I was also aware that peer-reviewers can be overly harsh and leave you feeling that your work is of no value whatsoever. Just take a look here if you don’t believe me.

However, what struck me about this rejection was a single sentence from the editor. I’ll provide it here in context:

In addition, the English needs improvements as the reviewers expressed difficulties of understanding or unclear statements throughout the manuscript. Please seek professional English editing services. You are welcome to use any editing service of your liking. If needed, professional English editing services are available at [redacted].

Ok, so my English is now so bad that I need “professional help”? Granted, I am not a native English speaker, but I still feel that I’m able to write rather understandable English. I know I should not be the judge of that myself, and according to the editor the “reviewers expressed difficulties of understanding or unclear statements throughout the manuscript”. Did they, now?

Here is what reviewer 1 has to say: “Finally, there are some grammar issues to fix.”

Reviewer 2 takes a page-by-page-line-by line approach, noting ~5 occurrences of spelling errors, and saying that the text in “unclear” in a couple of places.

Reviewer 3 has no comments regarding the language at all.

So, in light of this: One of three reviewers found several spelling errors and had trouble understanding some parts. This reviewer was also the most critical in other aspects, suggesting that the paper should be rewritten completely, as she/he does not agree with our main idea. Fair enough. But to extrapolate from that to say that the text is difficult to understand and that I need professional help? Chill the fuck down, mr. editor!

What I fail to understand is why feedback like this should warrant the statement: “Please seek professional English editing services”. To me this phrase sounds like: “you do not know how to write in English, do something about it!!”.

Well, maybe I’m just a bit angry and disappointed that the paper was rejected, but this just affirms my views on academic publishing. A complete lack of empathy and understanding and the idea that “since somebody once gave me harsh critique I need to be harsh as well”. Well, fuck that shit. By all means, point out my errors, encourage me to re-write and re-phrase myself, but don’t be such a fucking dick about it.

And, yeah, I don’t discard the idea that this “suggestion” is based on the fact that the publishing house probably makes some money on referring me to a “professional English editing service”.

Mendeley is dead, long live Zotero!

When I started out on my PhD two years ago I found Mendeley and thought it a perfect reference manager: Free to use, integrated with both MS Word and my browser and a generally easy-to-use GUI. What’s not to like?

Fast-forward two years. One of my papers was rejected and in the process of re-submitting it I needed to re-format the bibliography (more on that frustration in another post). Then Mendeley started acting up: “There was a problem setting up Word plugin communication: The address is protected”. Wtf? I re-installed the Word Plugin, I re-installed Mendeley itself, I tried some hints from this blog, I even watched a couple of YouTube videos. All to no avail. The Mendeley Word plugin did not work!

So I did what I usually do when life is mean to me: I took to Twitter. And complained. The Mendeley team was quick to answer, but their troubleshooting as nothing more than what I already had tried, plus encouraging me to “turn it off and then on again”. Nothing worked. A bit frustrated I replied:

Ok: how do I migrate my data away from Mendeley, and what is the best alternative to Mendeley?

The next day I had no reply and send a more official support request, and was met with this gem:

Dear Customer,

Thank you for submitting your question. This is to confirm that we have received your request and we aim to respond to you within 24 hours.

However, please note our current response time is 5 days.

Ok. Fuck this. I then remember hearing about Zotero, an Open Source reference manager. It seemed to offer both a Word-plugin and browser extension, as well as a method for importing my Mendeley data. Upon installation I chose “import from Mendeley” and found that it was not possible, due to encryption. I then found this site and found yet another reason to migrate away from Mendeley. Luckily my latest backup lacked only 20 items or so, so after 10 minutes of wrangling I had imported all of my data.

And I was impressed: Zotero understood that my Word doc was previously managed by Mendeley, and I did not have to change out all my references and rebuild the bibliography. So, in 30 minutes or so I had a working reference manager again, and I’ve moved from a closed platform incapable of providing adequate support to an open alternative that seems to work great!

So: if you are having trouble at all with Mendeley I would strongly suggest to migrate to Zotero!

Vsts: setting up tests and coverage to run on build for Javascript projects

I’m currently writing Javascript code, React and Redux to be more specific. After picking up the brilliant book “Human Redux” I’ve really started to enjoy this ecosystem.

But, remembering the brilliant Zombie TDD I also want to get back in the testing-game. This is quite easy when using create-react-app, as Jest is a great tool.

However, this is not the topic of this post. The topic here is how to get Microsoft VSTS (Visual Studio Team Services) to run your tests during the build phase, and report test results and coverage, and provide you with stuff like this:

2018-07-25 10_14_11-Window

coverage

You need to do stuff both to your project, and to your vsts build definition.

First off, your project:

You need the jest-junit package

npm install jest-junit -S

And, you need to edit your package.json file. First, add the top-level entry “jest”, with the following content

"jest": {
    "coverageReporters": [
      "cobertura",
      "html"
    ]
  },

and the top-level entry “jest-junit”, with the following content

"jest-junit": {
    "suiteName": "jest tests",
    "output": "test/junit.xml",
    "classNameTemplate": "{classname} - {title}",
    "titleTemplate": "{classname} - {title}",
    "ancestorSeparator": " > ",
    "usePathForSuiteName": "true"
},

Finally, you need to add the task “test:ci” to your script-block:

"test:ci": "react-scripts test --env=jsdom --testResultsProcessor=\"jest-junit\" --coverage",

So, what have we done here?

  1. We set the coverage reporters of Istanbul (which jest uses) write both the cobertura and html formats
  2. We set up jest-junit to produce junit-xml from our tests
  3. We create a test-task to be run on vsts that uses these two

You also want to add the resulting files to .gitignore

# testing
/coverage
/test

This should now work locally, test it by running

CI=true npm run test:ci
The coverage-folder should be created and the file test/junit.xml should be created.

So, everyting is good on the project side, time to move on to vsts.

Create a build, and add three tasks:

The first is an “npm” task, configure it like this
npm_test

Then, you need to publish the results of the test and the coverage, so add a

“Publish Test Results” and a “Publish Code Coverage Results” task

npm_publish_test

npm_publish_coverage

Make sure you select “Even if a previous task has failed, unless the build was canceled” on the option “Run this task” under the tab “Control Options” for both publish tasks, as we want tests reports and coverage even if we have failing tests.

In addition you want to set the environment variable CI to true, in order for Jest to run all tests:

ci_true

With these things in place your build should now include test results and coverage reports!

The SOSI-format: The crazy, Norwegian, geospatial file format

Imagine trying to coordinate the exchange of geospatial data long before the birth of the Shapefile, before XML and JSON was thought of. Around the time when “microcomputers” was _really_ new, and mainframes was the norm. Before I was born.

Despite this (rather sorry) state of affairs, you realize that the “growing use of digital methods used in the production and use of geospatial data raises several coordiantion-issues” [my translation]. In addition, “there is an expressed wish from both software companies and users of geospatial data that new developments does not lead to a chaos of digital information that cannot be used without in-depth knowledge and large investments in software” [my translation].

Pretty forward-thinking if you ask me. Who was thinking about this in 1980? Turns out that two Norwegians, Stein W. Bie and Einar Stormark, did this in 1980, by writing this report.

This report is fantastic. It’s the first hint of a format that Norwegians working with geospatial data (and few others) still has to relate to today. The format, known as the “SOSI-Format” (not to be confused with the SOSI Standard) is a plaintext format for representing points, lines, areas and a bunch of other shapes, in addition to attribute data.

My reaction when I first encountered this format some 8 years ago was “what the hell is this?”, and I started on a crusade to get rid of the format (“there surely are better formats”). But I was hit by a lot of resistance. Partly because I confused the format with the standard, partly because I was young and did not know my history, partly because the format is still in widespread use, and partly because the format is in many ways really cool!

So, I started reading up on the format a bit (and made a parser for it in JavaScript, sosi.js). One thing that struck me was that a lot of things I’ve seen popping up lately has been in the SOSI-format for ages. Shared borders (as in TopoJSON) Check! Local origins (to save space) Check! Complex geometries (like arcs etc) Check!

But, what is it like? It’s a file written in what’s referred to as “dot-notation” (take a look at this file and you’ll understand why). The format was inspired by the british/canadion format FILEMATCH and a french database-system called SIGMI (anyone?).

The format is, as stated, text (i.e. ASCII) based, with the reason was that this ensured that data could be stored and transferred on a wide range of media. At the time of writing the report, there existed FORTRAN-implementations (for both Nord-10/S and UNIVAC 1100) for reading and writing. Nowadays, there exists several closed-source readers and writes for the format (implemented by several Norwegian GIS vendors), in addition to several Open Source readers.

The format is slated for replacement by some GML-variation, but we are still waiting. There is also GDAL/OGR support for the format, courtesy of Statens Kartverk. However, this requires a lot of hoop-jumping and make-magic on Linux. In addition, the current implementation does not work with utf-8, which is a major drawback as most .SOS-files these days are in fact utf-8.

So, there we are. The official Norwegian format for exchange of geographic information in 2018 is a nearly 40 year old plain text format. And the crazy thing is that this Norwegian oddity is actually something other countries are envious about, as we actually have a common, open (!), standard for this purpose, not some de-facto reverse-engineered binary format.

And, why indeed, why should the age of a format be an issue, as long as it works?