Cratedb update a column with 3 million records

Hello community. so I am running an update query.

UPDATE "transactions_v2" set "itemPrices" = NULL;

My table has about 3 million records in it. When running this query I get an error.

Error!

[object Object]

I can’t see the details of the error either. Is there a better way to run an update query with millions of records.

Could you maybe share the table information and CrateDB version.

Are you using the Admin UI, and if so did you enable the “Show error trace” option?

Crate version is: Version:4.6.7.

I am not getting anything logged for the error trace.

Thank you!

This seems to be a bug.
Could you try upgrading to a newer CrateDB version (i.e. 4.8.4 or 5.0.1) and see if this persists.

Otherwise since we are talking only about 3 million records copying the table without the column might be an option

SHOW CREATE TABLE transactions_v2;
CREATE TABLE transactions_v3 ...
INSERT INTO transactions_v3 (<columns,...>) SELECT <columns,...> FROM transactions_v2;
ALTER CLUSTER SWAP TABLE transactions_v3 TO transactions_v2

Gotchya, I am running into some issues when running the same update query.

Error!

SQLParseException[Couldn't create execution plan from logical plan because of: the shard 1 of table [scp_service_transaction.transactions_v2/YQFVfMZHRXinZwfALmruKQ] is not available: Fetch["itemPrices"] └ Limit[100::bigint;0] └ Collect[scp_service_transaction.transactions_v2 | [_fetchid] | true]]

I am unable to query my table and I get the error above. The health on my table is also labeled “red”. Any way I can fix my table so I can query again? I ran the update query multiple times.

Fixed by running

ALTER CLUSTER REROUTE RETRY FAILED

did this fix the original problem?

No it did not fix the original problem where the update query was returning an error. After doing some more investigation I was running queries like

UPDATE "scp_service_transaction"."transactions_v2" set "itemPrices" = NULL WHERE "itemPrices" IS NOT NULL;

That seemed to return “OK” for me.

My question is how would I run an update query in batches? So I want to run an update query where we take the first 500 records and update, then take the next 500 and update all the way until reaching the total number of records in the database. It seems like thats the only way to update the whole table without the table reaching bad health.

Hi,
In other systems it is important to batch operations like this and you would find implementations of the SQL UPDATE command that would take a LIMIT clause or equivalent or let you JOIN the table to be updated with a lookup table. These options are not currently available in CrateDB, you could always use your client process to look up the PK values of the records that need updating first and then run the UPDATE against 500 of those records at a time, but the reality is that this should not be needed in CrateDB, there is no locking, and atomicity is at the document level, so the command should work just fine for millions of records. Are you seeing any issues now that you managed to get pass the error message you had initially?
Another option to consider is to copy the rest of the data into a new table and then swap the tables.

So when I run the update command against millions of records the best I was able to get was update 2.2 million records. What do you mean by PK values? How do I look up the PK values? For me copying data to another table is not really an option because in Production everything is pointing to that one table and used by many microservices. I am still seeing the same issue running the update query where I get.

Error!

[Object Object]

I have to run the update query about two or three times to fully update all the columns in the database.

The two update queries I am working with are

UPDATE "scp_service_transaction"."transactions_v2" set "itemPrices" = NULL WHERE "itemPrices" is Not Null;
``` and

UPDATE “scp_service_transaction”.“transactions_v2”
SET “itemPrices” = extract_item_prices(“serializedTransaction”);

I defined 'extract_item_prices' as a javscript function in cratedb.

For the second query I wrote a migration script in order to update the DB but the best result I got was updating 2.3 million records out of 2.9 million.

Apologies, I misunderstood that after the REROUTE RETRY FAILED you were able to successfully run to completion the UPDATE with WHERE "itemPrices" IS NOT NULL

Could you check if there are additional error details in /var/log/crate/crate.log ?

By PK I mean primary key values, if you did not define a primary key in this table there will still be an _id column (see System columns — CrateDB: Reference )
you could for instance run

SELECT _id
FROM "scp_service_transaction"."transactions_v2"
WHERE "itemPrices" is Not Null
LIMIT 500

and then

UPDATE "scp_service_transaction"."transactions_v2"
SET "itemPrices" = NULL 
WHERE _id IN (...)

until the first query does not return any results anymore

So I came up with a solution

UPDATE "scp_service_transaction"."transactions_v2"
SET "itemPrices"= extract_item_prices("serializedTransaction")
WHERE "transactionId" in (
    SELECT "transactionId" from "scp_service_transaction"."transactions_v2" where "itemPrices" IS NULL AND get_array_length("itemPrices") = -1 limit 500
  );

I wrote a migration script with a while loop where once the update query returns a rowcount of zero we stop updating. Thanks guys!

1 Like