Copying table results in corrupt data

Hi

Due to a bug in Crate (see post: Adding column generates error) we are unable to add a column to our production database.

While waiting for the pull request to be accepted and included in the next release, I wanted to do some testing with renaming tables. (Creating a new table which includes the new columns and then switching the names.). See what the impact would be and if it would result in downtime.

All tests were done on a development cluster with the same structure as our production database but less data

As a fist step, I created a copy of a table with the same structure. Next, I wanted to copy the data of the original table into the test table.
The table contains 29 million records as is 7GB in size.

To copy the data I used the naive

insert into test.table (select * from test.table)

This query took a while to complete so I left it overnight.
When I returned in the morning, the query seemed to have completed.
But when trying to query the database (any table) everything was unresponsive and the admin panel reported /usr/share/crate/lib/site/index.html: Too many open files

I Rebooted al the nodes one by one in the 3-node cluster after which the admin panel reported that a lot of shards were missing!
The original table from which the data was copied, was completely missing in the shards panel as no shards were available.
After rebooting a second time, the shards were still not up.
by running alter cluster reroute retry failed 3 of the 6 shards came online.

select * from sys.allocations reports

  • cannot allocate because all found copies of the shard are either stale or corrupt
  • or cannot allocate because allocation is not permitted to any of the nodes
    for the 3 missing shards for the primary and secondary shards respectively.

Querying data form the table gives
org.jkiss.dbeaver.model.sql.DBSQLException: SQL Error [XX000]: ERROR: Couldn't create execution plan from logical plan because of: the shard 1 of table [data.meterreading/EXqQQPdfSsm_CV5qD0ylAg] is not available:

Can somebody give some insight into why this happened?
How can I get the table back online? (even with data loss?)
How can this be prevented?
Is it a bad idea to copy data with a simple insert into table2 (select * from table 1)
What is an alternative to efficiently copy data between tables?

Hi @Lukas,

can you share the CREATE TABLE statement for the original and test table with us (SHOW CREATE TABLE)?

Did you create a snapshot (backup) of your table? Snapshots — CrateDB: Reference

If yes, you should be able to restore the data from your backup. If no, I would recommend to setup backups and create a snapshot before major operations on your cluster.

Is it a bad idea to copy data with a simple insert into table2 (select * from table 1)

Typically I would recommend to explicitly list the columns you want to copy for improved control what does happen.

insert into table2 (select * from table 1) will work as well but does have some potential pitfalls and only works reliable if schema matches exactly.

Can somebody give some insight into why this happened?

I cannot give a definitive answer. My guess would be that you run into the “Too many open files” issue while the data was copied. Due to this “too many open files” issue CrateDB was not able to open/create new files for new shards and that’s how your cluster ended up in this state.

There is another topic on “Too many open files” in our community which may give you some pointers in debugging this issue: What to do to fix "Too many open files" on a 3-node cluster with almost no data?

Hi Jayeff

Thanks for the reply!

I did not know about the Crate Snapshots. The nodes as a whole were snapshotted and restored. The cluster booted successfully albeit with the data missing since snapshot. As it is just a test cluster, this is not a real problem.
But I will look into the crate snapshots for future table changes.

I did specify all column names in the select and insert statement, but I did not add it in my post for brevity.

The thread you mentioned about “To many open files” does not provide any real answers just some ways to check if to many files are open, not how to prevent the issue.

The CREATE TABLE code for the source table is. The test table is gone due to the snapshot restore, but it was created from the DDL from the source table “meterreading” without changing any of the sharding config.

CREATE TABLE IF NOT EXISTS "data"."meterreading" (
   "servicepointassetenergytixcode" TEXT,
   "datetime" TIMESTAMP WITH TIME ZONE,
   "recalculatedconsumption1" DOUBLE PRECISION,
   "recalculatedconsumption2" DOUBLE PRECISION,
   "recalculatedinjection1" DOUBLE PRECISION,
   "recalculatedinjection2" DOUBLE PRECISION,
   "recalculatedactualconsumption" DOUBLE PRECISION,
   "recalculatedactualinjection" DOUBLE PRECISION,
   "recalculatedtotalapparentpower" DOUBLE PRECISION,
   "recalculatedtotalreactivepower" DOUBLE PRECISION,
   "recalculatedactivepowerphase1" DOUBLE PRECISION,
   "recalculatedactivepowerphase2" DOUBLE PRECISION,
   "recalculatedactivepowerphase3" DOUBLE PRECISION,
   "recalculatedvoltagel1l2" DOUBLE PRECISION,
   "recalculatedvoltagel2l3" DOUBLE PRECISION,
   "recalculatedvoltagel3l1" DOUBLE PRECISION,
   "recalculatedvoltagellavg" DOUBLE PRECISION,
   "recalculatedvoltagel1n" DOUBLE PRECISION,
   "recalculatedvoltagel2n" DOUBLE PRECISION,
   "recalculatedvoltagel3n" DOUBLE PRECISION,
   "recalculatedvoltagelnavg" DOUBLE PRECISION,
   "recalculatedcurrentphase1" DOUBLE PRECISION,
   "recalculatedcurrentphase2" DOUBLE PRECISION,
   "recalculatedcurrentphase3" DOUBLE PRECISION,
   "recalculatedcurrentavg" DOUBLE PRECISION,
   "recalculatedtotalpowerfactor" DOUBLE PRECISION,
   "recalculatedfrequency" DOUBLE PRECISION,
   "recalculateddatagramdatetime" TIMESTAMP WITH TIME ZONE,
   "recalculatedproduction1" DOUBLE PRECISION,
   "recalculatedproduction2" DOUBLE PRECISION,
   "recalculatedactualproduction" DOUBLE PRECISION,
   "recalculatedpulsequantity2" DOUBLE PRECISION,
   "recalculatedinductivepowerconsumption" DOUBLE PRECISION,
   "recalculatedinductivepowerinjection" DOUBLE PRECISION,
   "recalculatedcapacitivepowerconsumption" DOUBLE PRECISION,
   "recalculatedcapacitivepowerinjection" DOUBLE PRECISION,
   "recalculatedinductivepowerconsumptiontest" DOUBLE PRECISION,
   PRIMARY KEY ("servicepointassetenergytixcode", "datetime")
)
CLUSTERED INTO 6 SHARDS
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'default',
   column_policy = 'strict',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0-1',
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

Hi @Lukas,

thanks for your reply. Nothing jumps out from your CREATE TABLE statement so this should be fine.

It does not solve the issue but potentially gives some hints about what could be going on.

So for example I would expect that it helps if you do not connect via browser and leave it open over night. Instead you could use crash or psql to connect to your cluster and execute your query.

If you have some time to investigate further and do find some clues on this we are happy to hear about it. We currently do not know what could be the culprit here. Our own attempts to reproduce this behaviour did not work