Efficient PostgreSQL/PostGIS imports from Python using copy statements

For some reason I’ve had to import fairly large chunks of data into PostgreSQL/PostGIS using Python lately. When I wrote the HOGS app I decided that COPY statements was the way to go, as this is a fast and reliable way of getting large chunks of data into PostgreSQL.

But, as you might know, “COPY moves data between PostgreSQL tables and standard file-system files. ” [1]. Ugh, those pesky files. My data isn’t in files and shouldn’t be.

So I started googling and found this gist. That seemed to do the trick: Works with Python, uses psychopg2, and eliminates the need for a file. All the same while allowing me to use COPY statements. All good?

Well, almost. PostGIS is a bit picky. You need to provide the geometries as WKB. But, the WKB standard does not specify srid. So, if your table is defined with a column Geometry(4326), you’ll get an error. The solution? Use EWKB (the e is for extended). Using a tool such as pygeos lets you add srid to a WBK-encoded geometry like this:

from pygeos import to_wkb, from_wkb, set_srid

def add_srid(wkb, srid=4326):
    return to_wkb(
        set_srid(from_wkb(wkb), srid),
        hex=True,
        include_srid=True
    )

Another issue, worth considering when importing, say several million geometries, is that you probably want some commits in between. Mocking a bit about with generators, i found a way to split a generator into a generator that yields generators, so that you could iterate over, say 10 000 elements and hit a break. Something like this:

def split_generator(gen, elements_pr_generator):
    is_running = True

    def generator(gen):
        nonlocal is_running
        c = 0
        while c < elements_pr_generator:
            try:
                n = next(gen)
                yield n
            except StopIteration as e:
                is_running = False
                raise e
            c += 1

    c = 0
    while is_running:
        if c % elements_pr_generator == 0:
            yield generator(gen)
        c += 1

One last thing you might end up getting bitten by is the fact that the copy statement expects a file. And a file consists of text. And if you are inserting, say a dictionary to a json(b)-column, you might end up with errors. Same for None-values. In order to handle this, i wrote a line-generator, that takes a record from your generator and transforms it to a proper line:

def format_line(record, column_data):
    line_template = '\t'.join(['%s'] * len(column_data))
    data = []
    for column in column_data:
        key = column['key']
        value = None

        # get the value
        if key in record:
            value = record[key]

        if value is None:
            data.append('\\N')
        elif 'encoder' in column:
            data.append(column['encoder'](value))
        else:
            data.append(value)

    return line_template % tuple(data)

This function takes in a definition for each column, gets the data from the generator data, and formats it as a proper line to be fed to the COPY-statement.

If you combine these elements you end up with a small utility that lets you copy most python data into PostgreSQL with ease. So, if this is something you need, check out pg_geomcopy on github!

Hc Svnt Dracones

Rablerier fra en hurragutt med mye sprøtt i nøtta

Efficient PostgreSQL/PostGIS imports from Python using copy statements

Leave a Reply Cancel reply