Very Slow Query Performance

I’ve got a query to show new users vs returning users and it was working very quickly during development (under 500ms) but once we started using production levels of data execution time has spiraled out of control and is taking 20 seconds on average for only 1 months worth of data.

Interestingly even if I omit the UNION to let me do a comparison with the prior time period it still takes about 19 seconds. It looks like from the query plan it’s using Hash Joins which should be quick.

SQL Query:

SELECT
  min(visits.event_time) as date,
  coalesce(sum(visits.sess_count) FILTER (
    WHERE
      visits.sess_count = 1
  ), 0) AS new_vis,
  coalesce(avg(visits.sess_len) FILTER (
    WHERE
      visits.sess_count = 1
  ), 0) AS new_avg,
  coalesce(sum(visits.sess_count) FILTER (
    WHERE
      visits.sess_count > 1
  ), 0) AS ret_vis,
  coalesce(avg(visits.sess_len) FILTER (
    WHERE
      visits.sess_count > 1
  ), 0) AS ret_avg,
  coalesce(sum(visits.sess_count), 0) AS tot_vis,
  coalesce(avg(visits.sess_len), 0) AS tot_avg
FROM
  (
    SELECT
      max(joined_vis.totaltime) AS sess_len,
      count(joined_vis.session_id) AS sess_count,
      min(joined_vis.event_time) as event_time
    FROM
      (
        SELECT
          af_dev.pageview.totaltime AS totaltime,
          af_dev.user_session.user_id AS user_id,
          af_dev.user_session.session_id AS session_id,
          af_dev.pageview.event_time AS event_time,
          af_dev.user_session.domain AS domain,
          af_dev.user_session.hostname AS hostname
        FROM
          af_dev.pageview
          JOIN af_dev.user_session ON af_dev.pageview.user_id = af_dev.user_session.user_id
        WHERE
          af_dev.user_session.domain = 'domain.com'
          AND af_dev.user_session.hostname = 'www.domain.com'
          AND af_dev.user_session.event_time BETWEEN '2022-12-05' AND '2023-01-05'
          AND af_dev.pageview.event_time BETWEEN '2022-12-05' AND '2023-01-05'
          AND af_dev.pageview.domain = 'domain.com'
          AND af_dev.pageview.host = 'www.domain.com'
      ) AS joined_vis
    GROUP BY
      joined_vis.user_id,
      joined_vis.session_id
  ) AS visits
UNION
SELECT
  min(p_visits.event_time) as date,
  sum(p_visits.sess_count) FILTER (
    WHERE
      p_visits.sess_count = 1
  ) AS new_vis,
  avg(p_visits.sess_len) FILTER (
    WHERE
      p_visits.sess_count = 1
  ) AS new_avg,
  sum(p_visits.sess_count) FILTER (
    WHERE
      p_visits.sess_count > 1
  ) AS ret_vis,
  avg(p_visits.sess_len) FILTER (
    WHERE
      p_visits.sess_count > 1
  ) AS ret_avg,
  sum(p_visits.sess_count) AS tot_vis,
  avg(p_visits.sess_len) AS tot_avg
FROM
  (
    SELECT
      max(joined_vis.totaltime) AS sess_len,
      count(joined_vis.session_id) AS sess_count,
      min(joined_vis.event_time) as event_time
    FROM
      (
        SELECT
          af_dev.pageview.totaltime AS totaltime,
          af_dev.user_session.user_id AS user_id,
          af_dev.user_session.session_id AS session_id,
          af_dev.pageview.event_time AS event_time,
          af_dev.user_session.domain AS domain,
          af_dev.user_session.hostname AS hostname
        FROM
          af_dev.pageview
          JOIN af_dev.user_session ON af_dev.pageview.user_id = af_dev.user_session.user_id
        WHERE
          af_dev.user_session.domain = 'domain.com'
          AND af_dev.user_session.hostname = 'www.domain.com'
          AND af_dev.user_session.event_time BETWEEN '2022-11-05' AND '2022-12-05'
          AND af_dev.pageview.event_time BETWEEN '2022-11-05' AND '2022-12-05'
          AND af_dev.pageview.domain = 'domain.com'
          AND af_dev.pageview.host = 'www.domain.com'
      ) AS joined_vis
    GROUP BY
      joined_vis.user_id,
      joined_vis.session_id
  ) AS p_visits;

SQL Explain:

GroupHashAggregate[date, new_vis, new_avg, ret_vis, ret_avg, tot_vis, tot_avg]
  └ Union[date, ret_vis, ret_avg, new_vis, tot_vis, tot_avg, new_avg]
    ├ Eval[min(event_time) AS date, coalesce(sum(sess_count) FILTER (WHERE (sess_count > 1::bigint)), 0::bigint) AS ret_vis, coalesce(avg(sess_len) FILTER (WHERE (sess_count > 1::bigint)), 0.0) AS ret_avg, coalesce(sum(sess_count) FILTER (WHERE (sess_count = 1::bigint)), 0::bigint) AS new_vis, coalesce(sum(sess_count), 0::bigint) AS tot_vis, coalesce(avg(sess_len), 0.0) AS tot_avg, coalesce(avg(sess_len) FILTER (WHERE (sess_count = 1::bigint)), 0.0) AS new_avg]
    │  └ HashAggregate[min(event_time), sum(sess_count) FILTER (WHERE (sess_count > 1::bigint)), avg(sess_len) FILTER (WHERE (sess_count > 1::bigint)), sum(sess_count) FILTER (WHERE (sess_count = 1::bigint)), sum(sess_count), avg(sess_len), avg(sess_len) FILTER (WHERE (sess_count = 1::bigint))]
    │    └ Rename[event_time, sess_count, sess_len] AS visits
    │      └ Eval[min(event_time) AS event_time, count(session_id) AS sess_count, max(totaltime) AS sess_len]
    │        └ GroupHashAggregate[user_id, session_id | min(event_time), count(session_id), max(totaltime)]
    │          └ Rename[session_id, user_id, event_time, totaltime] AS joined_vis
    │            └ Eval[session_id AS session_id, user_id AS user_id, event_time AS event_time, totaltime AS totaltime]
    │              └ HashJoin[(user_id = user_id)]
    │                ├ Collect[af_dev.pageview | [totaltime, event_time, user_id] | ((((event_time >= 1670198400000::bigint) AND (event_time <= 1672876800000::bigint)) AND (host = 'www.domain.com')) AND (domain = 'domain.com'))]
    │                └ Collect[af_dev.user_session | [user_id, session_id] | ((((domain = 'netris.ai') AND (hostname = 'www.netris.io')) AND (event_time >= 1670198400000::bigint)) AND (event_time <= 1672876800000::bigint))]
    └ Eval[min(event_time) AS date, sum(sess_count) FILTER (WHERE (sess_count > 1::bigint)) AS ret_vis, avg(sess_len) FILTER (WHERE (sess_count > 1::bigint)) AS ret_avg, sum(sess_count) FILTER (WHERE (sess_count = 1::bigint)) AS new_vis, sum(sess_count) AS tot_vis, avg(sess_len) AS tot_avg, avg(sess_len) FILTER (WHERE (sess_count = 1::bigint)) AS new_avg]
      └ HashAggregate[min(event_time), sum(sess_count) FILTER (WHERE (sess_count > 1::bigint)), avg(sess_len) FILTER (WHERE (sess_count > 1::bigint)), sum(sess_count) FILTER (WHERE (sess_count = 1::bigint)), sum(sess_count), avg(sess_len), avg(sess_len) FILTER (WHERE (sess_count = 1::bigint))]
        └ Rename[event_time, sess_count, sess_len] AS p_visits
          └ Eval[min(event_time) AS event_time, count(session_id) AS sess_count, max(totaltime) AS sess_len]
            └ GroupHashAggregate[user_id, session_id | min(event_time), count(session_id), max(totaltime)]
              └ Rename[session_id, user_id, event_time, totaltime] AS joined_vis
                └ Eval[session_id AS session_id, user_id AS user_id, event_time AS event_time, totaltime AS totaltime]
                  └ HashJoin[(user_id = user_id)]
                    ├ Collect[af_dev.pageview | [totaltime, event_time, user_id] | ((((event_time >= 1667606400000::bigint) AND (event_time <= 1670198400000::bigint)) AND (host = 'www.domain.com')) AND (domain = 'domain.com'))]
                    └ Collect[af_dev.user_session | [user_id, session_id] | ((((domain = 'domain.com') AND (hostname = 'www.domain.com')) AND (event_time >= 1667606400000::bigint)) AND (event_time <= 1670198400000::bigint))]

Table sizes:
af_dev.pageview 5.7 M records, 1 GB
af_dev.user_session 35,000 records, 16 MB

Crate version: 5.1.2

Hi,
I would try if something like this performs better

WITH joined_vis AS (
	SELECT	pageview.totaltime,user_session.user_id,user_session.session_id,pageview.event_time 
	FROM af_dev.pageview
	JOIN af_dev.user_session ON af_dev.pageview.user_id = af_dev.user_session.user_id
	WHERE user_session.domain = 'domain.com' AND user_session.hostname = 'www.domain.com'
		AND user_session.event_time BETWEEN '2022-12-05' AND '2023-01-05'
		AND pageview.event_time BETWEEN '2022-12-05' AND '2023-01-05'
		AND pageview.domain = 'domain.com' AND pageview.host = 'www.domain.com'
),visits AS (
	SELECT max(totaltime) AS sess_len,count(session_id) AS sess_count,min(event_time) AS event_time
	FROM joined_vis
	GROUP BY user_id,session_id
),newvisits AS (
	SELECT sum(sess_count) as new_visnull,avg(sess_len) as new_avgnull
	FROM visits 
	WHERE sess_count = 1
),returningvisits AS (
	SELECT sum(sess_count) as ret_visnull,avg(sess_len) as ret_avgnull
	FROM visits 
	WHERE sess_count > 1)
,statsfromallvisits AS (
	SELECT min(event_time) AS "DATE",sum(sess_count) AS tot_visnull,avg(sess_len) as tot_avgnull
	FROM visits 
)
SELECT "DATE",
	coalesce(new_visnull, 0) AS new_vis,coalesce(new_avgnull, 0) AS new_avg,
	coalesce(ret_visnull, 0) AS ret_vis,coalesce(ret_avgnull, 0) AS ret_avg,
	coalesce(tot_visnull, 0) AS tot_vis,coalesce(tot_avgnull, 0) AS tot_avg
FROM statsfromallvisits,returningvisits,newvisits

If it is still slow and you can share the table definitions (SHOW CREATE TABLE) I will be happy to look at this further.

Thank you.

Hi @robd003

Could you post the output of an EXPLAIN ANALYZE and do a count(*) on each of the joining selections (i.e. af_dev.pageview and af_dev.user_session)

Hi again,
I noticed you shared the table definitions in a GitHub Discussion.
Thank you for that. Let’s maybe continue the conversation over here to avoid duplication.
I was particularly interested on whether you had any partitioning strategy defined or custom sharding columns.

In addition to what @proddata requested, could you also share with us the statistics from the following query?

SELECT tablename,attname,n_distinct
FROM pg_stats 
WHERE schemaname='af_dev' 
AND tablename IN ('user_session','pageview') 
AND attname IN ('domain','hostname','event_time','host','user_id','session_id');

Thanks guys, for some reason the post was blacklisted after I posted it here. (Looks like the Crate discussion was just locked)

Here’s the results from the EXPLAIN ANALYZE on the query: EXPLAIN ANALYZE Very Slow Query - Pastebin.com

Here’s the results from pg_stats:

+--------------+------------+------------+
| tablename    | attname    | n_distinct |
+--------------+------------+------------+
| user_session | hostname   |        7.0 |
| user_session | user_id    |    34118.0 |
| user_session | session_id |    34792.0 |
| user_session | event_time |    34785.0 |
| user_session | domain     |        2.0 |
| pageview     | user_id    |     3534.0 |
| pageview     | session_id |     3659.0 |
| pageview     | host       |        4.0 |
| pageview     | event_time |  5437782.0 |
| pageview     | domain     |        2.0 |
+--------------+------------+------------+
SELECT 10 rows in set (0.001 sec)

Hi, the anti spam system had flagged the post for some reason, but I marked it as not spam, it should allow you to post without issues now.

Originally I was using a partitioned table based on month, but I thought that might be an issue so I moved to a non-partitioned table.

Right now I’m testing with the most basic setup of a single node and both tables have 4 shards on a 4 core system. (same number of shards as cores recommended here: Sharding guide — CrateDB: How-Tos ) Crate has a 3.5gb heap and the system has 8gb of ram. The dataset is only 1gb so I figured this was more than enough.

So it looks like using the WITH CTE is helping performance (Down to ~6 seconds from 19), but it’s still pretty far away from being a 1 second query.

Here’s the analyze JSON blob: EXPLAIN ANALYZE WITH cte attempt - Pastebin.com

Current query:

WITH joined_vis AS (
	SELECT pageview.totaltime,user_session.user_id,user_session.session_id,pageview.event_time
	FROM af_dev.pageview
	JOIN af_dev.user_session ON af_dev.pageview.user_id = af_dev.user_session.user_id
	WHERE user_session.domain = 'domain.com' AND user_session.hostname = 'www.domain.com'
		AND user_session.event_time BETWEEN '2022-12-05' AND '2023-01-05'
    AND user_session.device_type IS NOT NULL
		AND pageview.event_time BETWEEN '2022-12-05' AND '2023-01-05'
    AND pageview.domain = 'domain.com' AND pageview.host = 'www.domain.com'
),visits AS (
	SELECT max(totaltime) AS sess_len,count(session_id) AS sess_count,COUNT(DISTINCT user_id) as visitors,min(event_time) AS event_time,device_type
	FROM joined_vis
	GROUP BY user_id,session_id
),newvisits AS (
	SELECT sum(visitors) as new_vis,avg(sess_len) as new_avg
	FROM visits 
	WHERE sess_count = 1
),returningvisits AS (
	SELECT sum(visitors) as ret_vis,avg(sess_len) as ret_avg
	FROM visits 
	WHERE sess_count > 1)
,statsfromallvisits AS (
	SELECT min(event_time) AS event_date,sum(visitors) AS tot_vis,avg(sess_len) as tot_avg
	FROM visits 
),
pjoined_vis AS (
	SELECT pageview.totaltime,user_session.user_id,user_session.session_id,pageview.event_time
	FROM af_dev.pageview
	JOIN af_dev.user_session ON af_dev.pageview.user_id = af_dev.user_session.user_id
	WHERE user_session.domain = 'domain.com' AND user_session.hostname = 'www.domain.com'
		AND user_session.event_time BETWEEN '2022-11-05' AND '2022-12-05'
    AND user_session.device_type IS NOT NULL
		AND pageview.event_time BETWEEN '2022-11-05' AND '2022-12-05'
    AND pageview.domain = 'domain.com' AND pageview.host = 'www.domain.com'
),pvisits AS (
	SELECT max(totaltime) AS sess_len,count(session_id) AS sess_count,COUNT(DISTINCT user_id) as visitors,min(event_time) AS event_time,device_type
	FROM pjoined_vis
	GROUP BY user_id,session_id
),pnewvisits AS (
	SELECT sum(visitors) as new_vis,avg(sess_len) as new_avg
	FROM pvisits 
	WHERE sess_count = 1
),preturningvisits AS (
	SELECT sum(visitors) as ret_vis,avg(sess_len) as ret_avg
	FROM pvisits 
	WHERE sess_count > 1)
,pstatsfromallvisits AS (
	SELECT min(event_time) AS event_date,sum(visitors) AS tot_vis,avg(sess_len) as tot_avg
	FROM pvisits 
) SELECT event_date, tot_vis, new_vis, ret_vis, tot_avg, new_avg, ret_avg from statsfromallvisits, newvisits, returningvisits
UNION SELECT event_date, tot_vis, new_vis, ret_vis, tot_avg, new_avg, ret_avg from pstatsfromallvisits, pnewvisits, preturningvisits;

Thank you for posting the analysis of the queries and the stats!

Can I please ask you to have a try with replacing BETWEEN with a simpler filter: >= '2022-12-05' (I know this is not what you want, just to compare something) and paste also the EXPLAIN ANALYZE?

I see something strange in the internal queries that are executed regarding both the domain/host strings and the timestamp range:

 "QueryDescription": "+(+(+event_time:[1667606400000 TO 9223372036854775807] +event_time:[-9223372036854775808 TO 1670198400000]) +host:www.netris.io) +domain:netris.ai",

Hi,
Could you try this version?

WITH sessions AS (
	SELECT user_id,session_id
	FROM af_dev.user_session
	WHERE user_session.domain = 'domain.com' AND user_session.hostname = 'www.domain.com'
		AND user_session.event_time BETWEEN '2022-12-05' AND '2023-01-05'
		AND user_session.device_type IS NOT NULL
),pageviews AS (
	SELECT	pageview.totaltime,pageview.user_id,pageview.event_time 
	FROM af_dev.pageview
	WHERE pageview.event_time BETWEEN '2022-12-05' AND '2023-01-05'
		AND pageview.domain = 'domain.com' AND pageview.host = 'www.domain.com'
),visits AS (
	SELECT max(totaltime) AS sess_len,count(session_id) AS sess_count,COUNT(DISTINCT sessions.user_id) as visitors,min(event_time) AS event_time
	FROM pageviews	
	JOIN sessions ON pageviews.user_id = sessions.user_id	
	GROUP BY pageviews.user_id,sessions.session_id
),psessions AS (
	SELECT user_id,session_id
	FROM af_dev.user_session
	WHERE user_session.domain = 'domain.com' AND user_session.hostname = 'www.domain.com'
		AND user_session.event_time BETWEEN '2022-11-05' AND '2022-12-05'
		AND user_session.device_type IS NOT NULL
), ppageviews AS (
	SELECT	pageview.totaltime,pageview.user_id,pageview.event_time 
	FROM af_dev.pageview
	WHERE pageview.event_time BETWEEN '2022-11-05' AND '2022-12-05'
		AND pageview.domain = 'domain.com' AND pageview.host = 'www.domain.com'
),pvisits AS (
	SELECT max(totaltime) AS sess_len,count(session_id) AS sess_count,COUNT(DISTINCT psessions.user_id) as visitors,min(event_time) AS event_time
	FROM ppageviews	
	JOIN psessions ON ppageviews.user_id = psessions.user_id	
	GROUP BY ppageviews.user_id,psessions.session_id
)
SELECT		min(event_time) as event_date,
			sum(visitors) AS tot_vis,
			sum(visitors) FILTER (WHERE sess_count = 1) AS new_vis,
			sum(visitors) FILTER (WHERE sess_count > 1) AS ret_vis,
			avg(sess_len) AS tot_avg,
			avg(sess_len) FILTER (WHERE sess_count = 1) AS new_avg,			
			avg(sess_len) FILTER (WHERE sess_count > 1) AS ret_avg
FROM visits
UNION 
SELECT		min(event_time) as event_date,
			sum(visitors) AS tot_vis,
			sum(visitors) FILTER (WHERE sess_count = 1) AS new_vis,
			sum(visitors) FILTER (WHERE sess_count > 1) AS ret_vis,
			avg(sess_len) AS tot_avg,
			avg(sess_len) FILTER (WHERE sess_count = 1) AS new_avg,			
			avg(sess_len) FILTER (WHERE sess_count > 1) AS ret_avg
FROM pvisits

No problem, will give this a try now.

I just figured BETWEEN was an alias for > and <=

Rather amazing, this is now executing in under 200ms!

Would you like me to share the EXPLAIN ANALYZE plan?

I am glad that worked @robd003
If you do not mind sharing the EXPLAIN ANALYZE that may give us some insights for improving the optimizer logic in the future.
Thank you.

Here’s the EXPLAIN ANALYZE output: EXPLAIN ANALYZE 400ms Crate query - Pastebin.com

I’d love an option to have Crate automatically report back query usage via the UDC if it would help improve performance.