Translogs filling up disk space

Hi,
The main usecase we use Crate.io for is a cache, where we have 2-3 tables that are being update frequently.
Every minute we fetch data from a service (c.a. 500 records), we do some calculation, and then we UPDATE records in our Crate.io cluster.
Meaning, the data size itself is relatively small, c.a. 3000 records in that table, and c.a. 9000 records in another table, and the number of records will stay that small.

Still, after running the software for a month, maybe two, the free disk space starts to decrease until it eventually gets filled up to the point where Crate.io gets into read-only mode.

Looking at the disk, we can see that the translog files are being written, but not removed, so that the “translog” directory gets really large over time.
So far, I’ve only found a way to limit the size of the translog file itself (via the translog.flush_threshold_size), but I can’t find a way to limit the TOTAL size of all translogs on a particular node, or across the cluster.
Also the OPTIMIZE with flush does not remove those translogs, which should get cleared according to your documentation here about durability and storage.

So, right now it seems like, even though I’m doing updates only to my table, the translog would rise indefinitely, or is there a way to avoid this and to limit the total size of all translogs?

Thanks

Hi Senchi,

Nice to meet you!, you seem to be a long term user of CrateDB, I would love to hear your feedback, specially sore points, and also what drove you to it for that use case.

What version are you running?, this may be a known issue regarding check-points going out of sync.

Regarding your use case, this is what I understand:

  • Total of 3 tables, pre-populated with some data
  • update the data @ 500 per minute

Are you using a cluster?, how many nodes?

Best regards

Hi @miguel.arregui, nice to met you too! :slight_smile:

Yeah, I generally enjoy working with Crate.io, even if my system is far from what you’d call demanding to justify a cluster. At one point we did have a feature that required some timeseries tracking, but for different reasons we had to simplify that feature. We might pick it up though again in future.
Few years back I’ve been to one of your workshops in Dornbirn (Snowsprint 2015 I think), and, besides having a blast with the Team :D, I’ve kinda grown fond of the idea to have a cheap but still scalable alternative, also for simpler projects that you’d usually build on top of mysql or postgrees. Having an SQL interface and Postgrees Wire Protocol helps of course a lot too!

My current setup is three ec2 instances in AWS working as a cluster, which in the total is way cheaper than an RDS service from AWS, especially if you don’t have “too much” traffic to sync between the AZs. Each instance has 4GB local storage only.
Currently we’re running on version 3.3.5 (CE build). I’ve tried to switch to 4.x but I failed my first attempt with the new discovery settings. Meanwhile I’ve seen there have been some changes to the wire protocol to work better with json, which requires code changes on our side, so I’ve postponed that for “better days” :).
We have few more tables (i.e. config and similar), but these three I’ve mentioned are the ones that have the most transactions, and from which we see the translog files filling up the data directory.
And yes, we have a scheduler that runs every minute, which fetches c.a. 500 entries from another API, then it updates these three tables.
As I’ve mentioned, we’re trying to work with as little resources as needed, and that’s why we have only 4GB of local storage per node. But since our data is not that big, these 4GB should really suffice, should they not?

Thanks

Hi Senchi,

I am happy to hear, as far back as 2015!, so a long term runner. I resonate with what you say on postgres/mysql, I do everything on CrateDB these days.

Version 3.3.5 is a year old. In a year crate have changed the way of work, have added core engineers and have been busy improving the product, including the translog issue, as part of the maintenance for the 3rd party libs relied upon:

https://crate.io/docs/crate/reference/en/latest/appendices/release-notes/index.html

If you send the crate.yml file over I can have a look and advise on what changes to make.

Regarding the SQL, yeah! as well, we want to be as compatible as possible with Postgres at the wire level, and we add syntax and general improvements continuously. Are we talking heaps of code changes to move to 4.2.1 ?, what language?

When I read the use case, I observe:

  • every update to the static sized table results in
    1. create a record of the deletion of the doc
    1. create a new doc with the new values, which is added to the translog and memory buffer
    1. the docs get flushed out to segments and into disk
    1. at some point a lucene merge occurs, whereby segments are merged, to ditch deletions and reduce the number of segments

do you use snapshots? are they stored on the same 4GB volumes?.

CrateDB works best for use cases where the data does not mutate. A plain insert is just one operation, there are no deletes in the way, which will also affect when merges happen, yielding better performance. I personally don’t know how I would model your use case, maybe having a table per changing attribute, and inserting updates with a timestamp, partitioned by day, where the last entry by timestamp is the value we want, and then whack partitions every now and then to save space.

The translog issue will require an upgrade. What a cool opportunity to make decisions :slight_smile:

Best

Hi @miguel.arregui,

I’m not sure I follow completely, but are you saying that translog files will get cleared only if I’d do lots of inserts, but in my case, where I’m doing almost only updates, this becomes a problem and translogs will not become cleared?
And you’re also saying that in the 4.1.x versions this is not an issue anymore and the translogs will become cleared as they should?
If that is so, then upgrading to most recent version would be a priority.

The thing is, I remember having disk usage issues before, even on earlier versions. Only our development was much more active and we were rebuilding our cluster more often, so that these translogs issues were never really discovered (reseting the cluster also cleared all translogs).

Thanks

PS @miguel.arregui , my crate.yml is really noting special… this is how it looks like right now:

cluster.name: "Production-CrateDb"
network.host: 0.0.0.0
gateway.expected_nodes: 3
gateway.recover_after_nodes: 2
discovery.zen.minimum_master_nodes: 2
discovery.zen.hosts_provider: ec2
discovery.ec2.access_key: "my_access_key"
discovery.ec2.secret_key: "my_secret_key"
discovery.ec2.groups: "sg-xyz"
discovery.ec2.availability_zones:
  - eu-central-1a
  - eu-central-1b
  - eu-central-1c

I’m starting crate in Ubuntu 18 as a service in /etc/systemd/system/crate.service

[Unit]
Description=Crate.IO
After=syslog.target

[Service]
WorkingDirectory=/home/bamboo/crate-ce-3.3.5-df03f2e
User=bamboo
Environment="CRATE_HEAP_SIZE=2G"
LimitNOFILE=infinity
LimitMEMLOCK=infinity
LimitNPROC=4096
LimitAS=infinity
ExecStart= /home/bamboo/crate-ce-3.3.5-df03f2e/bin/crate
SuccessExitStatus=143

[Install]
WantedBy=multi-user.target

Also, you’ve asked about snapshots, but NO, we’re not using snapshots. Instead we keep json backup files via COPY TO/FROM.
So our deployment process looks something like this:

  1. create backup json files and upload to s3 (COPY TO)
  2. stop current API
  3. drop all tables/documents
  4. create new schema
  5. import json backups from s3 (COPY FROM)
  6. deploy new API
  7. e2e testing

Thanks

We’ve seen issues in 3.3 where the checkpoints between primary shard and replica shard copies would get out of sync. If that happens the translog cannot be trimmed because it wants to keep it for potential recoveries.

One workaround is to temporarily decrease the number of replicas to 0, then back up to 1 or whatever it was before.

If i remember correctly, a way to mitigate the issue from occurring is to set write.wait_for_active_shards to 1 on all tables ( ALTER TABLE tbl SET ("write.wait_for_active_shards = '1')

For larger data sets you’d probably want to switch to using the snapshot feature, as it is faster and less resource intense - specially since it supports incremental backups. So it doesn’t need to export the whole data again every time.

For small data sets COPY TO / FROM of course also works okay.

@mfussenegger thanks for the tips!
We’ve prioritized the upgrade to v4.2.1, and then we’ll see how things behave and if our monitoring throws some alerts again. Once we gather more data (which could take a while), I’ll get back to you.

In your opinion, as for our scenario where we have more updates than inserts, is this wait_for_active_shards trick still recommended with v4.2.1?

If the translogs keep accumulating, I’ll try the trick with dcreasing number of replicas to 0. This can be done in runtime without node/cluster restart, right?
If that works, that could be probably also included in our monitoring as a “cleanup task”.

Thanks

Yes it can be done at runtime

In your opinion, as for our scenario where we have more updates than inserts, is this wait_for_active_shards trick still recommended with v4.2.1?

Depends. the default value for wait_for_active_shards changed in 4.1.0 to 1. Any table that was created prior to that should be changed using ALTER TABLE. Newer tables don’t have to be changed.

Just fyi, the problem with filling up disk space still seems to persist as the following metrics show


I haven’t got time yet to play with the config as suggested by @mfussenegger , so this is “as is”.