For some reason I’ve had to import fairly large chunks of data into PostgreSQL/PostGIS using Python lately. When I wrote the HOGS app I decided that COPY statements was the way to go, as this is a fast and reliable way of getting large chunks of data into PostgreSQL.
But, as you might know, “COPY
moves data between PostgreSQL tables and standard file-system files. ” [1]. Ugh, those pesky files. My data isn’t in files and shouldn’t be.
So I started googling and found this gist. That seemed to do the trick: Works with Python, uses psychopg2, and eliminates the need for a file. All the same while allowing me to use COPY statements. All good?
Well, almost. PostGIS is a bit picky. You need to provide the geometries as WKB. But, the WKB standard does not specify srid. So, if your table is defined with a column Geometry(4326), you’ll get an error. The solution? Use EWKB (the e is for extended). Using a tool such as pygeos lets you add srid to a WBK-encoded geometry like this:
from pygeos import to_wkb, from_wkb, set_srid def add_srid(wkb, srid=4326): return to_wkb( set_srid(from_wkb(wkb), srid), hex=True, include_srid=True )
Another issue, worth considering when importing, say several million geometries, is that you probably want some commits in between. Mocking a bit about with generators, i found a way to split a generator into a generator that yields generators, so that you could iterate over, say 10 000 elements and hit a break. Something like this:
def split_generator(gen, elements_pr_generator): is_running = True def generator(gen): nonlocal is_running c = 0 while c < elements_pr_generator: try: n = next(gen) yield n except StopIteration as e: is_running = False raise e c += 1 c = 0 while is_running: if c % elements_pr_generator == 0: yield generator(gen) c += 1
One last thing you might end up getting bitten by is the fact that the copy statement expects a file. And a file consists of text. And if you are inserting, say a dictionary to a json(b)-column, you might end up with errors. Same for None-values. In order to handle this, i wrote a line-generator, that takes a record from your generator and transforms it to a proper line:
def format_line(record, column_data): line_template = '\t'.join(['%s'] * len(column_data)) data = [] for column in column_data: key = column['key'] value = None # get the value if key in record: value = record[key] if value is None: data.append('\\N') elif 'encoder' in column: data.append(column['encoder'](value)) else: data.append(value) return line_template % tuple(data)
This function takes in a definition for each column, gets the data from the generator data, and formats it as a proper line to be fed to the COPY-statement.
If you combine these elements you end up with a small utility that lets you copy most python data into PostgreSQL with ease. So, if this is something you need, check out pg_geomcopy on github!