Optimal way to do SELECT without saturating the heap

What would be a good practice to request big amounts of data from crate. I have a use case where I need to fetch all data available in a table and I wonder if there is a way to request all the data without saturating the heap size.

What client are you using?

Many postgres client libraries support a fetch size (e.g. asyncpg or jdbc)

Do just want to fetch a complete table or a resultset of a query?

I need to fetch a complete table. I took a look into asyncpg but I’m getting this error (this same error happened while trying with sqlalchemy and psycopg2):

asyncpg.exceptions.InternalServerError: line 1:1: mismatched input 'WITH' expecting {'SELECT', 'DEALLOCATE', 'CREATE', 'ALTER', 'KILL', 'BEGIN', 'START', 'COMMIT', 'ANALYZE', 'DISCARD', 'EXPLAIN', 'SHOW', 'OPTIMIZE', 'REFRESH', 'RESTORE', 'DROP', 'INSERT', 'VALUES', 'DELETE', 'UPDATE', 'SET', 'RESET', 'COPY', 'GRANT', 'DENY', 'REVOKE'}

or

line 1:1: mismatched input 'DECLARE' expecting {'SELECT', 'DEALLOCATE', 'CREATE', 'ALTER', 'KILL', 'BEGIN', 'START', 'COMMIT', 'ANALYZE', 'DISCARD', 'EXPLAIN', 'SHOW', 'OPTIMIZE', 'REFRESH', 'RESTORE', 'DROP', 'INSERT', 'VALUES', 'DELETE', 'UPDATE', 'SET', 'RESET', 'COPY', 'GRANT', 'DENY', 'REVOKE'} which is related to https://github.com/crate/crate/issues/10212

While successfully trying to fetch all the data to store it in a cursor with psycopg2 it throws:

BytesStreamOutput cannot hold more than 2GB of data

—Update—

Found a hacky way to do it through pandas: https://stackoverflow.com/questions/18107953/how-to-create-a-large-pandas-dataframe-from-an-sql-query-without-running-out-of

while True:
   sql = "SELECT * FROM {table} limit %d offset %d" % (chunk_size,offset)
   dfs.append(pd.read_sql(sql, conn))
   offset += chunk_size
   if len(dfs[-1]) < chunk_size:
      break
full_df = pd.concat(dfs)

Another approach could be to dump the table with crash and write it to a file, to later read it. Not sure if there are alternatives to these two.

1 Like

Another and the best way I think to retrieve all the data in the table would be with keyset pagination.

For the first iteration do:

SELECT {columns} FROM {table} ORDER by {time}, {id} LIMIT {size};

And for subsequent queries do:

SELECT {columns} FROM {table} WHERE {time} > {previous last item's time} AND {id} > {previous last item's id} ORDER by {time}, {id} LIMIT {size};

@cande1gut Is there a specific reason why you order by both time and id?

Specifically when using WHERE clause {time} > {previous last item's time} I think it’s worth considering what happens when you have records with the same time.

Example:

text | time
-----+-----
a    |    1
b    |    2
c    |    3
d    |    3
e    |    4
f    |    5

SELECT text FROM table ORDER BY time LIMIT 3; returns a, b, c
Now using time 3 of last row c for the subsequent query:
SELECT text FROM table WHERE time > 3 ORDER BY time LIMIT 3; returns e, f
Record d is never returned

1 Like

I totally skipped that, thanks for pointing that out. You would need to add an = in the {time} condition to consider the skipped row:

SELECT {columns} FROM {table} WHERE {time} >= {previous last item's time} AND {id} > {previous last item's id} ORDER by {time}, {id} LIMIT {size};

Considering that the time can be repeated and you have another column that has a unique value, this other column can be used to help with the order (talking about why I am ordering by time and id).

Considering that the time can be repeated and you have another column that has a unique value, this other column can be used to help with the order (talking about why I am ordering by time and id).

Given the id column is unique it should be sufficient to only order by id and use the WHERE clause {id} > {previous list item id}

1 Like