When trying to import a 2.6G file in v5, with heap set as high as 3g, it would run OutOfMemoryError. Up to 6g, and it failed on a circuit breaker. Yet, I can import the same vile in v 4.5.4 with only 500m heap.
Discovered the issue was that when you have a parse error, it just gobs up memory like crazy in v5. With v4, it correctly ends with a parser error with RETURN SUMMARY, even with only 500g heap. When I fixed the parser error, I was able to import with 1g heap with v5, and successfully with v4 with 500m heap. I did not try lower than 1g with v5.
Thank you for reporting this.
I would like to attempt to reproduce it, was it with 5.3.2 that you observed this? with a CSV or a JSON lines file? was it compressed?
I just had docker use crate:5.3, so I presume it was the latest 5.3.x on docker hub yesterday.
It was a JSON file, plain text, readable. Each line was an entry. Without this error introduced by SED with the name of one property (same property on each line), it imports just fine. This incorrect property looked something like:
with the closing " missing after dd_id. Should be
Not sure what you mean by compressed, but this was not “pretty”, as it was one line per entry. One entry can have a lot of properties, including objects inside of objects. I wouldn’t try to import a file that I did
jq . on.
If you are familiar with it, this was produced by DataDog after converting Java log files. Our pre-DataDog logs include a lot of Marker/JSON structured logging, so many entries have some depth in terms of objects inside objects.
It turns out this issue is already being tracked under
Sample errors for COPY FROM RETURN SUMMARY · Issue #14133 · crate/crate (github.com)
Could you upvote that/comment in there?