Cratedb error when using 'SELECT DISTINCT ON'

Hi,

I was fascinated by Cratedb and I’m trying it (as a complete newbie) to see if it’s possible to use it instead of postgresql, especially with time series.

I have a query that works perfectly in Postgresql but when run on CrateDB it throws an error.

SELECT DISTINCT ON (device_id) device_id, status, utime
FROM Mydb.logs
WHERE utime BETWEEN ‘2022-01-01 00:00:00’ and ‘2022-01-02 23:59:59’

‘utime’ is the datetime as CrateDB TIMESTAMP type.

The error is :
SQLParseException[line 1:17: no viable alternative at input 'SELECT DISTINCT ON']

CrateDB or query syntax issue ?

Thanks for your help.

Luca

Hi Luca,
CrateDB does not currently support DISTINCT ON , but you can achieve the same results with a query like:

SELECT t.device_id,t.status,t.utime
FROM (
	SELECT DISTINCT device_id,status,utime
		,row_number() OVER (
			PARTITION BY device_id 
			ORDER BY device_id,status,utime
			) AS rn
	FROM Mydb.logs
	WHERE utime BETWEEN '2022-01-01 00:00:00' AND '2022-01-02 23:59:59'
	) AS t
WHERE rn = 1
ORDER BY device_id,status,utime

Please note you can change the ORDER BY to meet your needs.
I hope this helps.

1 Like

Thank you very much for the solution and for the speed with which you replied !!
I’m liking CrateDB more and more, thanks also to the community.

Sorry if I take advantage of your courtesy, would you be kind enough to tell me how to modify your solution to locate the record with the timestamp closest to the second date or closest to a single specific date, without defining a time frame via BETWEEN … AND …) ?

A thousand thanks.

Luca

No worries at all, and apologies I got busy with something else and it took me a while to come back to you.

To get, for each device_id, the record with the utime closest to the upper boundary of the filter on utime, we could modify the ORDER BY in the row_number() function to ORDER BY utime desc

Starting in CrateDB 5.2.0 (which is currently in testing - not ready for production use - docker image crate/crate:5.2.0) there are new functions available (max_by and min_by) which allow to simplify the query to:

SELECT device_id,max_by(status,utime),max(utime)
FROM Mydb.logs
WHERE utime BETWEEN '2022-01-01 00:00:00' AND '2022-01-02 23:59:59'
GROUP BY device_id
ORDER BY 1,2,3;

Regarding distance to a specific point in time, would you be looking for the closest record regardless of it being before or after the point?

1 Like

Hi hernanc,
Thanks for your help.

Regarding the distance to a specific point in time the ideal would be to have a query that returns the closest record before a specific point in time.

Hi,
I think a query like this may be what you are looking for:

WITH devices as (SELECT DISTINCT device_id FROM Mydb.logs),
	latest_records_per_device AS (
		SELECT device_id, (
			SELECT _id 
			FROM Mydb.logs 
			WHERE logs.device_id=devices.device_id 
			AND logs.utime<=('2022-01-01 03:00:00'::TIMESTAMP) 
			ORDER BY logs.utime DESC 
			LIMIT 1) AS latestid
		FROM devices)
SELECT logs.device_id,logs.status,logs.utime
FROM latest_records_per_device
LEFT JOIN Mydb.logs on logs._id=latestid
ORDER BY 1,2,3;

Hi @hernanc,

thanks for helping.

Unfortunately when I run this query CrateDB returns nothing but the cluster load skyrockets.

I am sorry to hear that. One thing I am not sure I understood, were you looking to combine two different filters on time intervals? (one to narrow down the number of devices and a different one to lookup the statuses), because if utime being behind the special point in time is the only filter, I think you could use the query with the row_number and the ORDER BY utime desc

Also the new version with the max_by function should be only a few days away now, and using this new function, besides making queries easier, it is also a lot faster in all my tests, so you may want to wait or test in the meanwhile with the version from the testing channel, and use a query with max_by like:

SELECT device_id, max_by(status,utime) as LatestStatus,max(utime) as LatestUTime
FROM Mydb.logs 
WHERE logs.utime<=('2022-01-01 03:00:00'::TIMESTAMP) 
GROUP BY device_id
ORDER BY 1,2,3;
1 Like

Hi @hernanc, I’ll wait for new version to test the query.
Thx.

Luca

1 Like

Hi @hernanc,
built 5.2 docker image and I’ve tested your query.
It works very well , thank you very much.

Luca

1 Like