PostgreSQL - SELECT DISTINCT
We often need to get distinct/unique results from SQL queries. We use DISTINCT clause to remove duplicate values from a query result. However, at the same time, we can notice that query execution is slow soon after we add DISTINCT clause to a query. This blog post analyzes the performance impact of such queries to see whether we can make the query faster.
Let’s generate some test data to demonstrate this behavior.
-- table creation
create table t_distinct (id int, user_id int, project_id int, date_time timestamp without time zone);
-- generate test data
-- this inserts 10 million records to t_distinct table
insert into t_distinct
select row_number() over(), round(row_number() over()/1000), round(row_number() over()/100000) , date
from generate_series('2019-01-01'::date, '2022-12-31'::date,'10 seconds'::interval
) date
limit 10000000;
-- ensure table's statistics for the query planner are up to date
analyse t_distinct;
Suppose the query we want to tune is the one mentioned below: To check the accuracy of the planner’s estimates, detailed information of each operation, I/O, and various nodes in the plan, options such as ANALYZE and BUFFERS are helpful. Furthermore, I enabled the timing to get the query execution time. (\timing)
Using DISTINCT
explain (analyze, buffers) select distinct user_id
from t_distinct
where project_id = 1
and date_time > '2019-01-01 8:00:00'
and date_time < '2020-12-01 8:00:00';
You can see a DISTINCT clause in SELECT to remove duplicates from user_id column. I just created the table, inserted 10 million records into it, and ran stats update using ANALYZE. I still need to create indexes on the table.
Let’s see the query plan of the preceding query. Before that, below is the table structure.
postgresql13=# \d+ t_distinct
Table "public.t_distinct"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
------------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
id | integer | | | | plain | |
user_id | integer | | | | plain | |
project_id | integer | | | | plain | |
date_time | timestamp without time zone | | | | plain | |
Access method: heap
I ran the query a few times, and below is the query plan.
postgresql13=# explain (analyze, buffers) select distinct user_id
from t_distinct
where project_id = 1
and date_time > '2019-01-01 8:00:00'
and date_time < '2020-12-01 8:00:00';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=144313.41..144407.78 rows=9437 width=4) (actual time=631.417..648.435 rows=100 loops=1)
Group Key: user_id
Batches: 1 Memory Usage: 409kB
Buffers: shared hit=16108 read=47587
-> Gather (cost=1000.00..144149.93 rows=65392 width=4) (actual time=26.133..529.726 rows=100000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16108 read=47587
-> Parallel Seq Scan on t_distinct (cost=0.00..136610.73 rows=27247 width=4) (actual time=13.502..420.115 rows=33333 loops=3)
Filter: ((date_time > '2019-01-01 08:00:00'::timestamp without time zone) AND (date_time < '2020-12-01 08:00:00'::timestamp without time zone) AND (project_id = 1))
Rows Removed by Filter: 3300000
Buffers: shared hit=16108 read=47587
Planning Time: 0.119 ms
JIT:
Functions: 17
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.491 ms, Inlining 0.000 ms, Optimization 1.426 ms, Emission 23.986 ms, Total 27.904 ms
Execution Time: 649.938 ms
(18 rows)
The query took about ~650ms to execute. Parallel Seq Scan has been used to fetch the data. Seq scan makes sense because this table does not have any indexes. Seq scan operation is parallel because of a PostgreSQL setting, max_parallel_workers. In this scenario, PostgreSQL decided to use two workers (Workers Launched: 2)
Out of the total execution time of 650ms, Seq scan took about ~420ms, which is more than half of the execution time.
The rest of the execution time is for Gather Stream node and HashAggregate node. It means the DISTINCT operation is completed on memory. Memory usage of the query is 409KB, and the work_mem setting in this PostgreSQL instance is 4MB. Adequate memory is available to perform this query. Focusing on the I/O, disk access (47587) and 16108 blocks served directly from memory (shared hit). I have 128MB allocated to the shared_buffers, the default value. The memory is insufficient to keep the entire table’s data in a buffer because the table size is ~500MB.
postgresql13=# select pg_size_pretty(pg_table_size('t_distinct'));
pg_size_pretty
----------------
498 MB
(1 row)
Let’s create the index and execute the same query again. There are two possible indexes I can think of:
-- user_id as included column.
-- index 1
create index t_distinct_idx on t_distinct (project_id, date_time) include (user_id);
-- user id is part of the index key
-- index 2
create index t_distinct_idx2 on t_distinct (project_id, date_time, user_id);
Let’s create the index 1 and execute the query.
postgresql13=# \di+ t_distinct_idx
List of relations
Schema | Name | Type | Owner | Table | Persistence | Size | Description
--------+----------------+-------+----------+------------+-------------+--------+-------------
public | t_distinct_idx | index | postgres | t_distinct | permanent | 387 MB |
(1 row)
Size of the index 1 is 387 MB.
Below is the new query plan. It has indeed changed with the new index I created above.
postgresql13=# explain (analyze, buffers) select distinct user_id
from t_distinct
where project_id = 1
and date_time > '2019-01-01 8:00:00'
and date_time < '2020-12-01 8:00:00';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2931.36..3025.73 rows=9437 width=4) (actual time=226.507..226.692 rows=100 loops=1)
Group Key: user_id
Batches: 1 Memory Usage: 409kB
Buffers: shared hit=498
-> Index Only Scan using t_distinct_idx on t_distinct (cost=0.56..2767.88 rows=65392 width=4) (actual time=0.040..111.236 rows=100000 loops=1)
Index Cond: ((project_id = 1) AND (date_time > '2019-01-01 08:00:00'::timestamp without time zone) AND (date_time < '2020-12-01 08:00:00'::timestamp without time zone))
Heap Fetches: 0
Buffers: shared hit=498
Planning Time: 0.136 ms
Execution Time: 226.886 ms
(10 rows)
This time query execution time has reduced to ~230ms from ~650ms. It’s a significant improvement. Buffers: shared hit=498, and there are no disk reads. The new index determines the query results, thus less I/O. That is why it is much more efficient than the Seq Scan I got the first time.
Let’s drop the index and create the second index to ensure it makes any difference. This time no INCLUDE clause.
drop index t_distinct_idx;
create index t_distinct_idx2 on t_distinct (project_id, date_time, user_id);
postgresql13=# \di+ t_distinct_idx2
List of relations
Schema | Name | Type | Owner | Table | Persistence | Size | Description
--------+-----------------+-------+----------+------------+-------------+--------+-------------
public | t_distinct_idx2 | index | postgres | t_distinct | permanent | 387 MB |
(1 row)
postgresql13=# explain (analyze, buffers) select distinct user_id
from t_distinct
where project_id = 1
and date_time > '2019-01-01 8:00:00'
and date_time < '2020-12-01 8:00:00';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2931.36..3025.73 rows=9437 width=4) (actual time=222.884..223.016 rows=100 loops=1)
Group Key: user_id
Batches: 1 Memory Usage: 409kB
Buffers: shared hit=498
-> Index Only Scan using t_distinct_idx2 on t_distinct (cost=0.56..2767.88 rows=65392 width=4) (actual time=0.048..109.369 rows=100000 loops=1)
Index Cond: ((project_id = 1) AND (date_time > '2019-01-01 08:00:00'::timestamp without time zone) AND (date_time < '2020-12-01 08:00:00'::timestamp without time zone))
Heap Fetches: 0
Buffers: shared hit=498
Planning Time: 0.130 ms
Execution Time: 223.161 ms
(10 rows)
The index size is the same, 387MB. The user_id column in INCLUDE or part of the index key does not affect query execution time.
Using GROUP BY
Group By is another way to achieve the same result as DISTINCT. Let’s modify the query to use Group By and check the query plan. It is not a must to use the Aggregate function in GROUP BY. The table structure is the same with one index.
explain (analyze, buffers) select user_id
from t_distinct
where project_id = 1
and date_time > '2019-01-01 8:00:00'
and date_time < '2020-12-01 8:00:00'
group by user_id;
I executed the query a few times, and the avg execution time is ~230ms. No real difference in query execution time, even with GROUP BY in this instance.
postgresql13=# explain (analyze, buffers) select user_id
from t_distinct
where project_id = 1
and date_time > '2019-01-01 8:00:00'
and date_time < '2020-12-01 8:00:00'
group by user_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2306.01..2400.50 rows=9449 width=4) (actual time=222.923..223.053 rows=100 loops=1)
Group Key: user_id
Batches: 1 Memory Usage: 409kB
Buffers: shared hit=498
-> Index Only Scan using t_distinct_idx2 on t_distinct (cost=0.56..2177.46 rows=51418 width=4) (actual time=0.045..109.779 rows=100000 loops=1)
Index Cond: ((project_id = 1) AND (date_time > '2019-01-01 08:00:00'::timestamp without time zone) AND (date_time < '2020-12-01 08:00:00'::timestamp without time zone))
Heap Fetches: 0
Buffers: shared hit=498
Planning Time: 0.175 ms
Execution Time: 223.240 ms
(10 rows)
Every time, I can see the HashAggregate, which is the faster way to get the DISTINCT/GRPUP BY results, as it’s all done in memory.
Why Hash Aggregate
PostgreSQL has a config setting to disable the Hash Aggregate operation. I’m going to disable it and check the query plan. By default, the enable_hashagg attribute is on.
postgresql13=# show enable_hashagg;
enable_hashagg
----------------
on
(1 row)
postgresql13=#
postgresql13=# set enable_hashagg to off;
SET
postgresql13=#
postgresql13=# show enable_hashagg;
enable_hashagg
----------------
off
(1 row)
postgresql13=# explain (analyze, buffers) select distinct user_id
from t_distinct
where project_id = 1
and date_time > '2019-01-01 8:00:00'
and date_time < '2020-12-01 8:00:00';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=6200.92..6458.01 rows=9449 width=4) (actual time=232.287..398.426 rows=100 loops=1)
Buffers: shared hit=498, temp read=172 written=173
-> Sort (cost=6200.92..6329.46 rows=51418 width=4) (actual time=232.284..316.543 rows=100000 loops=1)
Sort Key: user_id
Sort Method: external merge Disk: 1376kB
Buffers: shared hit=498, temp read=172 written=173
-> Index Only Scan using t_distinct_idx2 on t_distinct (cost=0.56..2177.46 rows=51418 width=4) (actual time=0.022..114.175 rows=100000 loops=1)
Index Cond: ((project_id = 1) AND (date_time > '2019-01-01 08:00:00'::timestamp without time zone) AND (date_time < '2020-12-01 08:00:00'::timestamp without time zone))
Heap Fetches: 0
Buffers: shared hit=498
Planning Time: 0.176 ms
Execution Time: 399.191 ms
(12 rows)
Interestingly, PostgreSQL, this time, decided to use the disk (Sort Method: external merge) because we have disabled the Hash Aggregate. Query execution has jumped to ~400ms. Pay attention to the Buffers. It still shows the same shared hit, while additionally, I can see temp read and write operations which will add more I/O pressure to the query.
The same behavior could occur if the system does not have enough work_mem to perform operations such as Sort. But in this case, the reason to use an external disk is that I artificially disabled the Hash Aggregate.
I used the environment below to test all preceding code snippets.
postgresql13=# select version();
version
-----------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 13.9 (Ubuntu 13.9-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
(1 row)
Summary
Both DISTINCT and GROUP BY operations are expensive operations for the database engine. It is better to avoid them as much as possible and have good indexing strategies in case you need to use them. Ensure you have appropriate values set for work_mem and shared_buffers attributes, which is crucial to the performance of queries.
About the cover picture: I took this picture at a walking trail during the official visit to Leawood, KS, Sept 2022.