Fuzzy Search & Synonyms

Todd_Bryant · November 17, 2021, 2:25pm

I have a table with firstname and lastname. I want to find matches that tolerate minor inversions such as where lastname = ‘bronw’ (actual brown). Also, I would like to use a synonym file to fine nick names and diminutive names such as William=Bill. The columns have been defined as fulltext. Can you provide any assistance or reference? Thank you!

Michael_Kleen · November 18, 2021, 1:15pm

Hello Todd,

A common approach in information retrieval is to use n-grams for fuzzy search or spelling corrections. Each word will be split into tokens of n number of grams:

E.g. a 2-2 n-gram tokenizer will split a word into tokens starting from length 2 up until length 2:

'brown' -> ['br', 'ro','ow', 'wn']

The same approach is applied on the query term:

'bronw' -> ['br', 'ro', 'on' ,'nw']

Then the overlap is calculated and scored with a heuristic, CrateDB uses Okapi BM25 - Wikipedia. If you increase the length of the n-grams, the approach will be less fuzzy.

Here is a fully working example:

CREATE ANALYZER a1 (TOKENIZER t1 with (type='ngram', min_gram=2, max_gram=2, token_chars=['letter']));
CREATE TABLE doc.test(firstname varchar, lastname varchar, INDEX lastname_ft USING FULLTEXT (lastname) WITH (analyzer = 'a1'));
INSERT INTO doc.test (firstname, lastname) values ('charly', 'brown'), ('charly', 'braun'), ('charly', 'browne');  
SELECT firstname, lastname, _score FROM doc.test WHERE MATCH(lastname_ft, 'bronw') ORDER BY _score DESC;
+-----------+----------+------------+
| firstname | lastname |     _score |
+-----------+----------+------------+
| charly    | brown    | 0.17363958 |
| charly    | browne   | 0.1585405  |
| charly    | braun    | 0.13076457 |
+-----------+----------+------------+

Another approach for fuzzy matching is the fulltext search with the fuzziness parameters. This will internally use Levenshtein distance - Wikipedia to calculate the match. This approach is CPU intensive and should only be used for smaller datasets. On the other hand, n-grams will increase your storage size.

CrateDB does support synonym files. The synonym file needs to be placed in the config folder and must be in the Solr or WordNet synonym file format.

Here is a full working example:

config/synonyms.txt

William => Bill

CREATE ANALYZER a2 (TOKENIZER lowercase, TOKEN_FILTERS (my_synonyms WITH (type='synonym', synonyms_path='synonyms.txt')));
CREATE TABLE doc.test(name varchar, INDEX synonym_ft USING FULLTEXT (name) WITH (analyzer = 'a2'));
INSERT INTO doc.test (name) values ('Bill');
SELECT name FROM doc.test WHERE MATCH(synonym_ft, 'William');
+------+
| name |
+------+
| Bill |

Best Regards,

Michael

Todd_Bryant · November 18, 2021, 7:20pm

Michael,
Thank you very much. Can you provide an example that combines tokenizers?

Thanks!

Michael_Kleen · November 19, 2021, 2:10pm

Hi Todd,

You cannot combine tokenizers in an analyzer but you can combine multiple token filters:

CREATE ANALYZER a1 (TOKENIZER standard, TOKEN_FILTERS (lowercase, asciifolding, my_synonyms WITH (type='synonym', synonyms_path='synonyms.txt')));

Best Regards,

Michael

Topic		Replies	Views
ANALYZER with two and more TOKENIZER/TOKEN_FILTER CrateDB	3	482	November 23, 2021
Full text searching by multiple columns with some columns having weight only if they match SQL	0	607	May 17, 2019
Case Insensitive searching in cratedb CrateDB	1	736	September 20, 2022
How to search on analyzed fields CrateDB	1	632	February 18, 2019
Tables name convention SQL	6	705	February 9, 2022

Fuzzy Search & Synonyms

Related Topics