Missing records during ingestion via telegraf

Hello folks,

We setup a lab system where we send some syslog messages to a Kafka topic and use telegraf client (Use CrateDB With Telegraf, an Agent for Collecting & Reporting Metrics) to ingest these flattened JSON messages to CrateDB v5. It works.

We setup a CrateDB cluster with 3-nodes using this blog post:

Each pod is limited to 1GB memory and the heap size is set to 512M for Crate. We had the load balancer going to distribute the incoming traffic. Everything else is setup as default based on that blog.

To test this setup, we sent about 1000 syslog messages with 1msec interarrival time. We stopped the traffic and we only saw 20 messages ingested in CrateDB table. We checked the telegraf ingestion logs and it seems to keep up with the traffic. We checked “sys.jobs_log” on CrateDB and saw the bulk INSERT attempts by telegraf. There were no error logs in that table on CrateDB. The graphs on CrateDB WEB GUI doesnt look like it is consuming everything it has during these ingestion attempts; only about 10%-20% percent. Pod metrics doesnt show any spikes on CPU during ingestion.

Is there something we can try to tune on CrateDB to improve the performance. Obviously, 512MB is not a reasonable Heap size for production but we would like to know how much we can get out of this cluster.

Thanks,

Anything in the telegraf logs?
Have you checked the cratedb logs for any error or warning messages?

I didn’t see anything in the telegraf logs. It is running in debug mode and this is what I see:

2022-09-02T14:17:56Z D! [outputs.cratedb] Wrote batch of 68 metrics in 57.950116ms
2022-09-02T14:17:56Z D! [outputs.cratedb] Buffer fullness: 0 / 10000 metrics
2022-09-02T14:18:06Z D! [outputs.cratedb] Wrote batch of 28 metrics in 100.21573ms
2022-09-02T14:18:06Z D! [outputs.cratedb] Buffer fullness: 0 / 10000 metrics
2022-09-02T14:18:16Z D! [outputs.cratedb] Wrote batch of 9 metrics in 73.914379ms
2022-09-02T14:18:16Z D! [outputs.cratedb] Buffer fullness: 0 / 10000 metrics
2022-09-02T14:18:26Z D! [outputs.cratedb] Wrote batch of 54 metrics in 64.749813ms
2022-09-02T14:18:26Z D! [outputs.cratedb] Buffer fullness: 0 / 10000 metrics
2022-09-02T14:18:36Z D! [outputs.cratedb] Wrote batch of 20 metrics in 71.592292ms
2022-09-02T14:18:36Z D! [outputs.cratedb] Buffer fullness: 0 / 10000 metrics

I just took this from telegraf; the syslog generator was not running. Kafka is only receiving some syslogs from our firewall.

Here is the crate log output:

I think we are still having some missing records under this volume of load.

Can you recreate the metrics table

CREATE TABLE IF NOT EXISTS <table_name> (
	"hash_id" LONG INDEX OFF,
	"timestamp" TIMESTAMP,
	"name" STRING,
	"tags" OBJECT(DYNAMIC),
	"fields" OBJECT(DYNAMIC),
	"day" TIMESTAMP GENERATED ALWAYS AS date_trunc('day', "timestamp"),
	PRIMARY KEY ("timestamp","day")
) PARTITIONED BY("day");

without the hash_id as part of the primary key?


The CrateDB telegraf plugin is calculating a hash from the name and all provided tags.

Done. And I sent 209 syslog messages with 10msec inter-arrival time. Not seeing too much difference.

Should I not let telegraf to create a table? Because I am still seeing the same hash issue:

Sorry, my bad … of course you would need to remove all primary keys i.e.

CREATE TABLE IF NOT EXISTS <table_name> (
	"hash_id" LONG INDEX OFF,
	"timestamp" TIMESTAMP,
	"name" STRING,
	"tags" OBJECT(DYNAMIC),
	"fields" OBJECT(DYNAMIC),
	"day" TIMESTAMP GENERATED ALWAYS AS date_trunc('day', "timestamp")
) PARTITIONED BY("day");

in the influx / telegraf a message with the same name, timestamp and tags will only be counted as one message, even if the values are different.

Before creating that last table, do I need to do anything else on CrateDB besides a “DROP TABLE”?

Before creating that last table, do I need to do anything else on CrateDB besides a “DROP TABLE”?

you might need to stop telegraf from preventing to autocreate it again, but otherwise no.

Looks much better. I can see that the record count matches the syslog generator packet count.

Compare to the original table settings created by telegraf, by changing the table, did I actually give up on any read performance for long term performance?

Was this all happening because of the hash calculation? Does telegraf or CrateDB calculate the hash?

Thanks

Was this all happening because of the hash calculation? Does telegraf or CrateDB calculate the hash?

The hash calculation is done by the CrateDB Telegraf plugin and I would say it typically makes sense.

According to Influx

Tags: Key/Value string pairs and usually used to identify the metric.

i.e. a combination of name, tags and timestamp should be unique.
Not all inputs of telegraf ensure exactly once delivery, i.e.

inputs with duplicates:

{ "name" : "a", "tags" : { "tag_a" = "a", "tag_b" ="abc" }, "timestamp" : 1662131469 } 
{ "name" : "a", "tags" : { "tag_a" = "a", "tag_b" ="abc" }, "timestamp" : 1662131469 } 
{ "name" : "a", "tags" : { "tag_c" = "c" }, "timestamp" : 1662131469 } 

should lead to the following entries in CrateDB:

{ "name" : "a", "tags" : { "tag_a" = "a", "tag_b" ="abc" }, "timestamp" : 1662131469 } 
{ "name" : "a", "tags" : { "tag_c" = "c" }, "timestamp" : 1662131469 } 

By removing my primary key from my table “CREATE” statement, did I introduce more risk to have duplicate records in my table?

I saw in the other post, that you didn’t define any tags.

       data_format = "json"
       json_string_fields = ["SOURCE","PROGRAM","PRIORITY","MESSAGE","LEGACY_MSGHDR","HOST_FROM","HOST","FACILITY"]
       json_time_key = "DATE"
       json_time_format = "2006-01-02T15:04:05Z07:00"

some of the json_string_fields should probably be tag_keys and also json_name_key should be set.
Then you could use the primary key again and avoid any duplicates

Thanks, I will fix that.

Quick question. The “MESSAGE” field contains the syslog message itself. If I take that as a field, is the field indexed?

I am asking because we may need keyword search capability to locate relevant log messages.