Error running periodic ANALYZE

Hello,

I am currently running a 3 node Crate cluster (on Kubernetes) version 4.6.6. I have started monitoring CPU usage with Prometheus and I can see every day a CPU consumption spike between 16:30 and 16:40 on one of the pods. The pod is the one marked as master in the Crate management web interface. Looking at the logs of the pod I can see the following exception:

[2022-10-18T16:40:14,239][ERROR][i.c.s.TableStatsService  ] [crate-bemp-local-1] Error running periodic ANALYZE
java.util.concurrent.CompletionException: org.elasticsearch.transport.RemoteTransportException: [crate-bemp-local-1][10.2.4.193:4300][internal:crate:sql/analyze/fetch_samples]
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2158) ~[?:?]
	at io.crate.action.FutureActionListener.onFailure(FutureActionListener.java:52) ~[crate-server.jar:?]
	at io.crate.execution.support.MultiActionListener.countdown(MultiActionListener.java:85) ~[crate-server.jar:?]
	at io.crate.execution.support.MultiActionListener.onResponse(MultiActionListener.java:70) ~[crate-server.jar:?]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:57) ~[crate-server.jar:?]
	at org.elasticsearch.transport.TransportService$TimeoutResponseHandler.handleResponse(TransportService.java:1040) ~[crate-server.jar:?]
	at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:224) ~[crate-server.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[crate-server.jar:?]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutor.execute(EsExecutors.java:161) ~[crate-server.jar:?]
	at org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:216) ~[crate-server.jar:?]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:135) ~[crate-server.jar:?]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:104) ~[crate-server.jar:?]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:704) ~[crate-server.jar:?]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) ~[crate-server.jar:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) ~[netty-codec-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) ~[netty-codec-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[netty-handler-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) ~[netty-transport-native-epoll-4.1.65.Final-linux-x86_64.jar:4.1.65.Final]
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) ~[netty-transport-native-epoll-4.1.65.Final-linux-x86_64.jar:4.1.65.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[netty-transport-native-epoll-4.1.65.Final-linux-x86_64.jar:4.1.65.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[netty-common-4.1.65.Final.jar:4.1.65.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.65.Final.jar:4.1.65.Final]
	at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [crate-bemp-local-1][10.2.4.193:4300][internal:crate:sql/analyze/fetch_samples]
Caused by: java.lang.IllegalStateException: unexpected docvalues type NONE for field 'exception' (expected one of [SORTED, SORTED_SET]). Re-index with correct docvalues type.
	at org.apache.lucene.index.DocValues.checkField(DocValues.java:317) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
	at org.apache.lucene.index.DocValues.getSortedSet(DocValues.java:410) ~[lucene-core-8.8.2.jar:8.8.2 a92a05e195b775b30ca410bc0a26e8e79e7b3bfb - mdrob - 2021-04-06 16:33:27]
	at io.crate.expression.reference.doc.lucene.BytesRefColumnReference.setNextReader(BytesRefColumnReference.java:69) ~[crate-server.jar:?]
	at io.crate.statistics.ReservoirSampler$DocIdToRow.apply(ReservoirSampler.java:255) ~[crate-server.jar:?]
	at io.crate.statistics.ReservoirSampler.getSamples(ReservoirSampler.java:215) ~[crate-server.jar:?]
	at io.crate.statistics.ReservoirSampler.getSamples(ReservoirSampler.java:126) ~[crate-server.jar:?]
	at io.crate.statistics.TransportAnalyzeAction.lambda$new$1(TransportAnalyzeAction.java:114) ~[crate-server.jar:?]
	at io.crate.execution.support.NodeActionRequestHandler.messageReceived(NodeActionRequestHandler.java:49) [crate-server.jar:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [crate-server.jar:?]
	at org.elasticsearch.transport.TransportService$6.doRun(TransportService.java:698) [crate-server.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [crate-server.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	... 1 more

I think just after this message the CPU consumption goes to normal. Here is the CPU consumption chart (the unit are millicores):

Should I need to be worried about this error? Is it normal? As said, is happening every day.

Best regards,

Hi @iames,

it looks like the high load is generated by the period statistics collection (configured as stats.service.interval, see Cluster-wide settings — CrateDB: Reference).
In CrateDB 5.1.0 (release notes, still in testing phase) we added a throttling mechanism to reduce such load spikes.

Nevertheless, collecting statistics shouldn’t cause an exception. Can you identify which table triggers this behavior and share the schema (a table that has a column named exception)?

Upgrading CrateDB could also help. CrateDB 4.6 uses Lucene 8.8.2, while CrateDB 5.1 uses Lucene 9.2.0. As the error is Lucene-related, it might have been resolved upstream by Lucene.

Sorry for my delay. Here I paste the suspect table:

CREATE TABLE IF NOT EXISTS "bemp"."api_log" (
   "timestamp" TIMESTAMP WITHOUT TIME ZONE,
   "message" TEXT,
   "method" TEXT,
   "headers" TEXT,
   "url" TEXT,
   "endpoint" TEXT,
   "client_id" BIGINT,
   "user_agent" TEXT,
   "remote_host" TEXT,
   "resp_status" INTEGER,
   "resp_mimetype" TEXT,
   "resp_data" TEXT,
   "exception" TEXT,
   "affected_rows" INTEGER,
   "query_time" REAL,
   "token_id" TEXT,
   "data" TEXT,
   "user_id" BIGINT,
   "modification_date" TIMESTAMP WITHOUT TIME ZONE GENERATED ALWAYS AS _cast(current_timestamp, 'timestamp without time zone'),
   "creation_date" TIMESTAMP WITHOUT TIME ZONE DEFAULT _cast(current_timestamp, 'timestamp without time zone') NOT NULL
)
CLUSTERED INTO 8 SHARDS
WITH (
   "allocation.max_retries" = 5,
   "blocks.metadata" = false,
   "blocks.read" = false,
   "blocks.read_only" = false,
   "blocks.read_only_allow_delete" = false,
   "blocks.write" = false,
   codec = 'default',
   column_policy = 'dynamic',
   "mapping.total_fields.limit" = 1000,
   max_ngram_diff = 1,
   max_shingle_diff = 3,
   number_of_replicas = '0-1',
   "routing.allocation.enable" = 'all',
   "routing.allocation.total_shards_per_node" = -1,
   "store.type" = 'fs',
   "translog.durability" = 'REQUEST',
   "translog.flush_threshold_size" = 536870912,
   "translog.sync_interval" = 5000,
   "unassigned.node_left.delayed_timeout" = 60000,
   "write.wait_for_active_shards" = '1'
)

And here are the stats from the UI:

I have re-created the table and imported all data using INSERT SELECT. I will check if today fails again. We have in our roadmap upgrade to the latest Crate version, but it is going to be impossible for a while.

Best regards,

After re-indexing the table and executing manual ANALYZE the error has gone. No error during yesterday nor executing it manually.

Best regards,

1 Like