Factorial Growth of Subqueries When Using Nested WITH Clauses in ClickHouse

ClickHouse is a popular OLAP database. It speaks SQL and earns the reputation of "fast and resource efficient". But the support of SQL comes with surprises if not careful. In this blog, I show that a simple query of nested WITH clauses in ClickHouse generates factorial number of subqueries. The simple query is short, reads nothing, process nothing and returns nothing. Yet, it uses a lot of CPU and memory to just parse the query. (Note: The version of ClickHouse used in this post is 23.8.)

Surprise from WITH clause in ClickHouse

At Effodio, we use ClickHouse to process metric data. In a recent load test, we were shocked when ClickHouse server used almost 100% CPU and some SELECT queries timeout after 10 seconds. The table of concern has less than a million rows and the result set has about 100 rows. Why ClickHouse takes 10 seconds? To dive into the issue, we first look at the query_log system table where ClickHouse records information about executed queries. We found two suspicious fields: memory_usage: 1.2 GB and ProfileEvents.SelectQueriesWithSubqueries: 578. On average, each subquery took 2MB memory, which seems OK. But why a simple query turns into 578 subqueries?

The query is constructed programmatically and uses nested WITH clauses for intermediate steps. In common RDBMS like Postgres, the WITH clause is calculated once and works like a temporary table with a lifetime of the query. However, ClickHouse substitutes the code defined in the WITH clause in all places of use. For example, the following query creates 1M rows using WITH and counts the number of rows. In Postgres, it always returns 1M, while in ClickHouse, it returns a different number every time.

 1-- In Postgres, the following query always outputs 1,000,000.
 2WITH cte_numbers AS 
 3  (
 4    SELECT random() AS num
 5    FROM generate_series(1, 1000000)
 6  )
 7SELECT count(*)
 8FROM cte_numbers
 9WHERE num IN (
10  SELECT num
11  FROM cte_numbers
12)
13
14-- In ClickHouse, the following query does not necessarily returns 1,000,000. 
15-- I ran it 3 times and got three outputs: 850,317; 822,638; and 803,773.
16WITH cte_numbers AS
17  (
18    SELECT num
19    FROM generateRandom('num UInt64', NULL)
20    LIMIT 1000000
21  )
22SELECT count()
23FROM cte_numbers
24WHERE num IN (
25  SELECT num
26  FROM cte_numbers
27)

Factorial growth of subqueries

Let's create a no-op query of n nested WITH clauses, an outer WITH queries its inner WITH. For example, with n=3, the query uses three WITH clauses named a, b and c.

 1-- A query of 3 nested WITH clauses. It is easy to see the result set is empty.
 2WITH a AS (
 3  WITH b AS (
 4    WITH c AS (
 5      SELECT 0 AS id
 6    )
 7    SELECT * FROM c
 8    WHERE 1 IN (
 9      SELECT id
10      FROM c
11    )
12  )
13  SELECT * FROM b
14  WHERE 1 IN (
15    SELECT id
16    FROM b
17  )
18)
19SELECT * FROM a
20WHERE 1 IN (
21  SELECT id FROM a
22)

Execute the query on ClickHouse gives us a query id 4f7379a7-1396-41e6-978b-fb91b2b1da60. Then find number of subqueries, latency and memory usage with the following query.

1SELECT
2    query_duration_ms,
3    memory_usage,
4    ProfileEvents.Values[indexOf(ProfileEvents.Names, 'QueriesWithSubqueries')] AS subqueries
5FROM system.query_log
6WHERE type = 'QueryFinish'
7AND query_id = '4f7379a7-1396-41e6-978b-fb91b2b1da60'

Repeat the process for n between 1 and 6. Here are the results.

n query_duration_ms memory_usage subqueries
1 8 9,760 6
2 13 25,752 28
3 32 4,239,229 145
4 156 25,223,105 876
5 1,490 188,844,822 6,139
6 8,468 1,496,512,694 49,120

What is the growth rate of the number of subqueries? Let $N_i$ be the number of subqueries for $i$ WITH clauses. Let's do some math to break down the numbers. $$N_1=6$$ $$ N_2 = 4 \times N_1 + 4 = 4 \times (6 + 1) = 28$$ $$ N_3 = 5 \times N_2 + 5 = 5 \times (28+1) = 145$$ $$ N_4 = 6 \times N_3 + 6 = 6 \times (145+1) = 876$$ $$ N_5 = 7 \times N_4 + 7 = 7 \times (876+1) = 6139$$ $$ N_6 = 8 \times N_5 + 8 = 8 \times (6139+1) = 49120$$

It is clear the number of subqueries grows factorially, $N_i=(2+i)*(N_{i-1} + 1)$.

How dangerous is the query?

When I run the query of n=8, the memory usage goes wild and I have to stop the server. You can take down a ClickHouse server by running just a few queries of n=8.

When I run the query of n=10, ClickHouse cancels the query giving the error AST is too big. Maximum: 500000: (after expansion of aliases). (TOO_BIG_AST).

Conclusion

  1. WITH clauses in ClickHouse is substituted at every reference of it. Thus, nested WITH clauses can generate factorial number of subqueries. This behavior is quite surprising for people coming from classic relational databases. Please exercise caution when using WITH clauses in ClickHouse.
  2. ClickHouse is super fast and has success stories at huge scale like Cloudflare. However, you may not have in-house ClickHouse experts like Cloudflare does. Please write down query patterns and test the patterns on ClickHouse before signing up.
  3. ClickHouse is quite different from other DB engines, by design. Check this Common Getting Started Issues with ClickHouse for more.