Incorrect result with TRY_CAST function

I am using crate.io version 4.7.0.

I have created table definition as

create table myorder (
	id long,
	ordervalue long,
	quantity long,
	quantityStr string,
	name string,
  primary key (id)
);

Note: quantityStr is column with type string.

Now let’s say there are 2 records with values for quantityStr column as:

quantityStr 
10 
89

Now, lets execute query select * from myorder where TRY_CAST(quantitystr AS bigint) > 3;
My assumption is that since we are casting quantitystr AS bigint, it should do numeric evaluation.

But output for the above query returns only 1 record having value 89 for quantityStr column
i.e. it doesn’t select record with value 10 for quantityStr column.

It seems TRY_CAST is doing character by character comparison of numeric string.

This is used to work with 3.2.7 version.

Could you please provide any info for this behaviour?
Is it a bug introduced in the version after 3.2.7 version.

Appreciate any update/feedback/suggestion.

2 Likes

This indeed seems to be a bug and related how numeric queries on text values are processed. In this case CrateDB does not just do a simple full table scan converting all quantitystr fields to numeric values in an intermediate result, but rather tries to establish a filter, so that the text index can be used.

One can see that behaviour when using EXPLAIN ANALYZE

explain analyze SELECT * FROM myorder
WHERE TRY_CAST(quantitystr AS bigint) > 2 limit 100;

which results in (simplfied)

{
  "Execute": {
    "Phases": {
      "0-collect": { "nodes": { "5aiwWNGYTi68j9XCA-qXlA": 1.112539 } },
      "1-mergeOnHandler": { "nodes": { "EKMSnr-9R0GfWu_n0-L1qg": 3.340537 } },
      "2-fetchPhase": { "nodes": { "5aiwWNGYTi68j9XCA-qXlA": 3.110859 } }
    },
    "5aiwWNGYTi68j9XCA-qXlA": {
      "QueryBreakdown": [
        {
          "QueryDescription": "quantitystr:{2 TO *}",
          "QueryName": "MultiTermQueryConstantScoreWrapper"
        },
        {
          "QueryDescription": "quantitystr:20 quantitystr:89",
          "QueryName": "BooleanQuery"
        }
      ]
    }
  }
}

The way numeric values are indexed within CrateDB using BKD-Trees nowadays, while in the past also inverted (text-based) indexes where used with padded 0.


Generally speaking I’d consider using text indexes for numeric lookups nowadays bad practices. I will however raise a issue in crate/crate on GitHub.

You might just want to do the parsing on insert with a generating function

create table myorder (
	id long,
	ordervalue long,
	quantity long,
	quantityStr string,
    quantityStr_parsed long GENERATED ALWAYS AS try_cast(quantityStr as long),
	name string,
  primary key (id)
);
1 Like

Thanks a lot for quick response. Really appreciate it.

  1. Could you please share the link for the issue that you are going to raise in GitHub?
  2. it seem issue is there on 4.6.x also. As noted in the original thread, this works with 3.2.7
  3. Is there any workaround possible for dealing with existing data?

Thanks for all your suggestions/thoughts…

  1. Could you please share the link for the issue that you are going to raise in GitHub?
  2. it seem issue is there on 4.6.x also. As noted in the original thread, this works with 3.2.7
    • I am not 100% certain, but this could related to the change in indexing
  3. Is there any workaround possible for dealing with existing data?
    • how big is the dataset (num_docs / size) ?

Thanks a lot for the reply.

Data size is around 1 billion records.

A rather expensive workaround (table scan) would be an altering view (e.g. adding 0)

CREATE OR REPLACE VIEW v_myorder AS 
SELECT quantity, try_cast(quantitystr as long)+0 AS quantitystr_long
FROM "doc"."myorder";

or also rather expensive with a UDF

CREATE FUNCTION text_gt (txt_val TEXT, num LONG)
RETURNS BOOLEAN
LANGUAGE JAVASCRIPT
AS 'function text_gt(txt_val, num) {
    return Number(txt_val) > num
}';
SELECT * FROM myorder
WHERE text_gt(quantitystr,3);

though I would not necessarily recommend that.


The other options

  • reindex
  • wait for fix

Thanks for your suggestions.

I also had come up with quick fix similar to that only…

CREATE or replace FUNCTION my_try_cast( input string   )
RETURNS bigint
LANGUAGE JAVASCRIPT
AS 'function my_try_cast(input) {
  if  (isNaN(input))  { //This will handle mix data (numeric or non numeric)
    return null 
  } else {
    return Number(input)
  }
}';

Will compare performance with your approach also.

Thanks a lot for all your suggestions…

1 Like