I recently came across these surprising logs over the course of an audit at one of Dalibo’s client: (the logs are strimmed for better readability and anonymisation).

Jul 26 00:07:51 postgres10[42230]: client=127.0.0.1 LOG:  duration: 406.403 ms  statement: select 1
Jul 26 00:08:02 postgres10[42237]: client=127.0.0.1 LOG:  duration: 780.613 ms  statement: COMMIT

I could come up with an explanation for the slow COMMIT (heavy writes => many WAL buffers to sync on disk, and maybe WAL compression added some latency) but I wouldn’t know what to say to the client about this very long select 1. This query clearly came from PgBouncer: “a simple do-nothing query to check if the server connection is alive”, but that didn’t help much.

Me (embarrassed) and
my client. — Me (embarrassed) and my client.

Let’s have a look at pg_stat_statements:

postgres=# SELECT rolname, datname, query, calls, min_time, max_time, mean_time
  FROM pg_stat_statements s JOIN pg_database d ON (s.dbid = d.oid)
  JOIN pg_roles r ON (r.oid = s.userid) WHERE octet_length(query) < 15
  AND rolname = 'anon' AND datname = 'anon';

 rolname | datname |    query    |    calls    | min_time |  max_time   |      mean_time      
---------+---------+-------------+-------------+----------+-------------+---------------------
 anon    | anon    | COMMIT      |  2078240339 | 0.000247 |   59.077196 | 0.00119367747128697
 anon    | anon    | ROLLBACK    |          81 | 0.000689 |    0.013588 | 0.00160246913580247
 anon    | anon    | BEGIN       |  2121191643 | 0.000251 |   57.558841 | 0.00114355394816151
 anon    | anon    | DISCARD ALL | 17515013300 | 0.004349 | 1442.615547 |  0.0350356836662835
(4 rows)

The select 1 aren’t there, and the longest COMMIT lasts 59ms. So what’s going on? The normalized query select $1 has probably been evicted from pg_stat_statements over the course of a deallocation, although its frequency is rather high. But the parameter pg_stat_statements.max is set to 1000, and this a heavy loaded server with about 30,000 tx/s and 1200 backends during the busiest periods. Now, we still don’t have any explaination about the inconsistency between pg_stat_statements and the logs, regarding the slow COMMITs.

So I started to look at the code (pg_stat_statements.c). Initially, I aimed at understanding better how the query time was computed. But I didn’t go too far, because I read this line of comment:

Note about locking issues: to create or delete an entry in the shared
hashtable, one must hold pgss->lock exclusively

… and then it clicked in my mind: it was probably a locking issue!

Or maybe not.

Anyway, it’s not too difficult to test this hypothesis. Let’s use pgbench with 16 custom scenarii like this one:

BEGIN;
SELECT 1 FROM t1;
SELECT 1 FROM t2;
SELECT 1 FROM t3;
SELECT 1 FROM t4;
SELECT 1 FROM t5;
SELECT 1 FROM t6;
SELECT 1 FROM t7;
SELECT 1 FROM t8;
COMMIT;

The queries are unique, event after normalization.

Here is a script to create the scenarii and the tables:

#!/bin/bash

psql -c "CREATE DATABASE pgbench;"

for x in `seq 1 128`; do
        psql -d pgbench -c "CREATE TABLE t$x (id int);"
done


foo() {
        x=$1
        start=$((1 + x*8))
        stop=$((8 + x*8))

        echo "BEGIN;"
        for x in `seq $start $stop`; do
                echo "SELECT 1 FROM t$x;"
        done
        echo "COMMIT;"
}

for y in `seq 0 15`; do
        f="scenario_$y.sql"
        foo $y > $f
done

These 16 scenarii amounts to 128 unique queries, which we need to reproduce the locking problem, since pg_stat_statements.max lowest possible value is 100.

And here is another one to launch 16 pgbench in parallel, one per scenario:

#!/bin/bash

psql -d pgbench -c "SELECT pg_stat_reset();"
psql -d pgbench -c "SELECT pg_stat_statements_reset();"

for x in `seq 0 15`; do
        pgbench -d pgbench -T 100 -f scenario_$x.sql 2> /dev/null &
        pids[${i}]=$!
done

for pid in ${pids[*]}; do
    wait $pid
done

psql -c "select xact_commit from pg_stat_database where datname = 'pgbench';"
psql -d pgbench -c "select * from pg_stat_statements_info;"

The last line of the above script uses a view that appeared with PostgreSQL 14. Indeed it seems I’m not the only one to have noticed this problem (see this thread on pgsql-hackers), and it’s now possible to know how many deallocations occured since the last reset of pg_stat_statements. A high number for a small period indicates that pg_stat_statements.max is too low.

Let’s launch the pgbench script with pg_stat_statements.max = 400:

 xact_commit 
-------------
      958181
(1 row)

 dealloc |          stats_reset          
---------+-------------------------------
       0 | 2021-10-01 16:35:51.180729+02

So we have an average of 9582 tx/s, and no deallocations.

Let’s launch the pgbench script again, with pg_stat_statements.max = 100:

 xact_commit 
-------------
      714498
(1 row)

 dealloc |          stats_reset          
---------+-------------------------------
  144878 | 2021-10-01 16:33:19.156198+02

That’s 25% less transactions!

Also, in the logs, we observe 4 queries that lasted more than 20ms:

2021-10-04 08:43:53.584 CEST [54262] LOG:  duration: 23.487 ms  statement: SELECT 1 FROM t63;
2021-10-04 08:43:56.335 CEST [54270] LOG:  duration: 21.671 ms  statement: SELECT 1 FROM t21;
2021-10-04 08:44:04.226 CEST [54265] LOG:  duration: 20.455 ms  statement: SELECT 1 FROM t109;
2021-10-04 08:44:28.480 CEST [54271] LOG:  duration: 48.653 ms  statement: COMMIT;

… compared to only one when pg_stat_statements.max is set to 400 (and pgbench lauched with rate limiting, so that the transaction rate matches the previous test).

What is the cause of this performance drop? It is clearly linked to these 144878 deallocations, but is it really the locking problem that I thought of? Let’s use perf together with Flame Graph for a On-CPU analysis of one of the 16 backends:

perf100 perf400

The two flame graphs look similar, except for the leftmost tower of the top one (pg_stat_statements.max = 100). Notice the WaitEventSetWait frame that accounts for 7.4% of all samples.

Note: You can download the svg files (right-click on the image) and run them in your browser to get the full Flame Graph experience.

If it’s really a locking problem, a Off-CPU analysis is more adequate:

off100 offf400

This time, the two graphs are very different!

The tower containing the futex_wait accounts for 28% percent of the off-cpu time, and I think it explains completely the performance drop. I’m unsure how to explain the other big tower (in the middle), but here’s my guess: this is off-cpu time which would otherwise have been caused by involontary context switches, in the absence of locking within pg_stat_statements.

Looking at wait events

PostgresPro provides a very nice extension, pg_wait_sampling, which we can use to understand what is slowing down a given backend :

Let’s launch the pgbench script again, with pg_stat_statements.max = 100, and let’s query the pg_wait_sampling_profile view :

$ PID=33903
$ psql -d pgbench -c "SET pg_wait_sampling.history_size = 100000;
> SELECT pg_wait_sampling_reset_profile();
> SELECT pg_sleep(10);
> SELECT event_type, event, count FROM pg_wait_sampling_profile WHERE pid=$PID"

 event_type |       event        | count
------------+--------------------+-------
 LWLock     | pg_stat_statements |    83
 Client     | ClientRead         |   364

LWLock stands for lightweight lock, and most of them protect a particular data structure in shared memory. In this case, we also get the name of the extension that is waiting for this lock, and with no surprise, it’s pg_stat_statements.

With pg_stat_statements.max = 400, we get :

 event_type |   event    | count
------------+------------+-------
 Client     | ClientRead |    25

No lightweight locks, and much less ClientRead events, which I’m unsure how to explain, but which is consistant with the Off-CPU flame graph (ClientRead events probably match the WaitEventSetWait tower, and LWLock events match the futex_wait tower).

Conclusion

The default value of pg_stat_statements.max is 5000. For some unknown reason, the client decreased this value to 1000. He probably assumed that the performance penalty of having pg_stat_statements loaded would be lesser, but on the contrary it resulted in a much bigger one. The documentation doesn’t say anything about the danger of setting this value too low. Only recently, in version 14, we can rely on the view pg_stat_statements_info, containing the counter deallocand the timestamp stats_reset, to get an idea about whether or not this parameter is set too low. On my 4-cpu machine, the performance drop is noticeable with more than 700 deallocations per second, but to be on the safe side, I would recommend to increase this parameter if there’s more than 10 deallocations per second, also in order to avoid the extra CPU cost of these deallocations. For PostgreSQL versions prior to 14, there’s no simple way of knowing if this parameter is set high enough, but the presence of slow queries that shouldn’t be slow in your logs is a good hint. Anyway, I would recommend sticking with the default value of pg_stat_statements.max, or increasing it, but never decreasing it.