For some reason I’ve had to import fairly large chunks of data into PostgreSQL/PostGIS using Python lately. When I wrote the HOGS app I decided that COPY statements was the way to go, as this is a fast and reliable way of getting large chunks of data into PostgreSQL.
But, as you might know, “COPY moves data between PostgreSQL tables and standard file-system files. ” [1]. Ugh, those pesky files. My data isn’t in files and shouldn’t be.
So I started googling and found this gist. That seemed to do the trick: Works with Python, uses psychopg2, and eliminates the need for a file. All the same while allowing me to use COPY statements. All good?
Well, almost. PostGIS is a bit picky. You need to provide the geometries as WKB. But, the WKB standard does not specify srid. So, if your table is defined with a column Geometry(4326), you’ll get an error. The solution? Use EWKB (the e is for extended). Using a tool such as pygeos lets you add srid to a WBK-encoded geometry like this:
from pygeos import to_wkb, from_wkb, set_srid
def add_srid(wkb, srid=4326):
return to_wkb(
set_srid(from_wkb(wkb), srid),
hex=True,
include_srid=True
)
Another issue, worth considering when importing, say several million geometries, is that you probably want some commits in between. Mocking a bit about with generators, i found a way to split a generator into a generator that yields generators, so that you could iterate over, say 10 000 elements and hit a break. Something like this:
def split_generator(gen, elements_pr_generator):
is_running = True
def generator(gen):
nonlocal is_running
c = 0
while c < elements_pr_generator:
try:
n = next(gen)
yield n
except StopIteration as e:
is_running = False
raise e
c += 1
c = 0
while is_running:
if c % elements_pr_generator == 0:
yield generator(gen)
c += 1
One last thing you might end up getting bitten by is the fact that the copy statement expects a file. And a file consists of text. And if you are inserting, say a dictionary to a json(b)-column, you might end up with errors. Same for None-values. In order to handle this, i wrote a line-generator, that takes a record from your generator and transforms it to a proper line:
def format_line(record, column_data):
line_template = '\t'.join(['%s'] * len(column_data))
data = []
for column in column_data:
key = column['key']
value = None
# get the value
if key in record:
value = record[key]
if value is None:
data.append('\\N')
elif 'encoder' in column:
data.append(column['encoder'](value))
else:
data.append(value)
return line_template % tuple(data)
This function takes in a definition for each column, gets the data from the generator data, and formats it as a proper line to be fed to the COPY-statement.
If you combine these elements you end up with a small utility that lets you copy most python data into PostgreSQL with ease. So, if this is something you need, check out pg_geomcopy on github!