Generate "word cloud" from textual data

Is it possible to generate a word cloud (Tag cloud - Wikipedia) using a text column in crate?

So let’s say that I want to store news data in a table in crate, with the news body goes to the body column, and the news title goes to the title column. It is possible to generate a word cloud out of the body column?

It would be far easier to use a word cloud library/module in your favorite language than use SQL. You could probably write a function to count word frequency per body and update counts into a temp table, which you then return the top 100 rows of the temp table as the “most frequent words counted”.

2 Likes

Thank you @sampope,

The thing is we have a huge amount of news articles, in millions, and using the library will be very slow.

Cratedb is based on Lucene, and has also some bits from elasticsearch, so I thought there might be a built-in function to generate the word cloud from textual fields, as this is a common and straight-forward request from elasticsearch. See this link for example python - How to generate a word cloud using elasticsearch? - Stack Overflow

@optimusprime

Unfortunately I am not aware of any similar capabilities to terms aggregations in CrateDB (yet).
Althought that might be possible with the underlying Lucene indexing, the query execution engine of CrateDB is totally different and separate from Elasticsearch.

If you think that it would be good to add such a feature feel free to create a feature request in the crate-repo

Thank you @proddata,

I will add the feature request

1 Like