<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://yhuelf.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://yhuelf.github.io/" rel="alternate" type="text/html" /><updated>2026-01-19T09:12:24+00:00</updated><id>https://yhuelf.github.io/feed.xml</id><title type="html">Frédéric Yhuel, PostgreSQL DBA</title><subtitle>This is a blog about PostgreSQL and Linux</subtitle><entry><title type="html">The strange case of the underestimated Merge Join node</title><link href="https://yhuelf.github.io/2026/01/19/under-estimated-mergejoin.html" rel="alternate" type="text/html" title="The strange case of the underestimated Merge Join node" /><published>2026-01-19T07:15:25+00:00</published><updated>2026-01-19T07:15:25+00:00</updated><id>https://yhuelf.github.io/2026/01/19/under-estimated-mergejoin</id><content type="html" xml:base="https://yhuelf.github.io/2026/01/19/under-estimated-mergejoin.html"><![CDATA[<p>This post appeared first on the <a href="https://blog.dalibo.com/2026/01/12/under-estimated-mergejoin.html">Dalibo blog</a>.</p>

<p><em>Brest, France, 19 January 2026</em></p>

<p>We recently encountered a strange optimizer behaviour, reported by one of our customers:</p>

<blockquote>
  <p><strong>Customer</strong>: “Hi Dalibo, we have a query that is very slow on the first execution after a batch process,
and then very fast. We initially suspected a caching effect, but then we noticed that the execution
plan was different.”</p>

  <p><strong>Dalibo</strong>: “That’s a common issue. Autoanalyze didn’t have the opportunity to process the table after the batch
job had finished, and before the first execution of the query. You should run the <code class="language-plaintext highlighter-rouge">VACUUM ANALYZE</code> command
(or at least <code class="language-plaintext highlighter-rouge">ANALYZE</code>) immediately after your batch job.”</p>

  <p><strong>Customer</strong>: “Yes, it actually solves the problem, but… your hypothesis is wrong. We looked at <code class="language-plaintext highlighter-rouge">pg_stat_user_tables</code>,
and are certain that the tables were not vacuumed or analyzed between the slow and
fast executions. We don’t have a production problem, but we would like to understand.”</p>

  <p><strong>Dalibo</strong>: “That’s very surprising! we would also like to understand…”</p>
</blockquote>

<p>So let’s dive in!</p>

<!--MORE-->

<h2 id="execution-plans">Execution plans</h2>

<p>The query is quite basic (table and column names have been anonymized):</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span>    <span class="o">*</span>
<span class="k">FROM</span>      <span class="n">bar</span>
<span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">foo</span> <span class="k">ON</span> <span class="p">(</span><span class="n">bar</span><span class="p">.</span><span class="n">a</span> <span class="o">=</span> <span class="n">foo</span><span class="p">.</span><span class="n">a</span><span class="p">)</span>
<span class="k">WHERE</span>     <span class="n">id</span> <span class="o">=</span> <span class="mi">10744501</span>
<span class="k">ORDER</span> <span class="k">BY</span>  <span class="n">bar</span><span class="p">.</span><span class="n">x</span> <span class="k">DESC</span><span class="p">,</span> <span class="n">foo</span><span class="p">.</span><span class="n">x</span> <span class="k">DESC</span><span class="p">;</span>
</code></pre></div></div>

<p>Here’s <a href="https://explain.dalibo.com/plan/61cee01aa499bad0">the plan</a> of the first execution of the query after the batch job:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Sort (cost=17039.22..17042.11 rows=1156 width=786) (actual time=89056.476..89056.480 rows=6 loops=1)
   Sort Key: bar.x DESC, foo.x DESC
   Sort Method: quicksort Memory: 25kB
   Buffers: shared hit=2255368 read=717581 dirtied=71206 written=11997
   -&gt; Merge Right Join (cost=2385.37..16980.41 rows=1156 width=786) (actual time=89056.428..89056.432 rows=6 loops=1)
         Inner Unique: true
         Merge Cond: (foo.a = bar.a)
         Buffers: shared hit=2255365 read=717581 dirtied=71206 written=11997
         -&gt; Index Scan using foo_fk1 on foo (cost=0.57..145690068.16 rows=80462556 width=734) (actual time=89050.555..89050.557 rows=1 loops=1)
               Buffers: shared hit=2255360 read=717574 dirtied=71206 written=11997
         -&gt; Sort (cost=2384.81..2387.70 rows=1156 width=52) (actual time=5.853..5.854 rows=6 loops=1)
               Sort Key: bar.a
               Sort Method: quicksort Memory: 25kB
               Buffers: shared hit=5 read=7
               -&gt; Index Scan using bar_fk1 on bar (cost=0.58..2326.00 rows=1156 width=52) (actual time=1.514..5.808 rows=6 loops=1)
                     Index Cond: (bar.id = 10744501)
                     Buffers: shared hit=5 read=7
 Settings: effective_cache_size = '20GB', random_page_cost = '2', work_mem = '100MB'
 Planning:
   Buffers: shared hit=11073 read=10738 dirtied=2
 Planning Time: 209.686 ms
 Execution Time: 89056.610 ms
</code></pre></div></div>

<p>And here’s <a href="https://explain.dalibo.com/plan/7d75010b3gb693g7">the plan</a> of the 2nd execution of the query:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Sort (cost=544154.99..544157.54 rows=1020 width=789) (actual time=0.222..0.224 rows=6 loops=1)
   Sort Key: bar.x DESC, foo.x DESC
   Sort Method: quicksort Memory: 25kB
   Buffers: shared hit=39
   -&gt; Nested Loop Left Join (cost=1.15..544104.02 rows=1020 width=789) (actual time=0.087..0.164 rows=6 loops=1)
         Buffers: shared hit=36
         -&gt; Index Scan using bar_fk1 on bar (cost=0.58..2052.42 rows=1020 width=52) (actual time=0.073..0.127 rows=6 loops=1)
               Index Cond: (bar.id = 10744501)
               Buffers: shared hit=12
         -&gt; Index Scan using foo_fk1 on foo (cost=0.57..528.77 rows=265 width=737) (actual time=0.004..0.004 rows=0 loops=6)
               Index Cond: (foo.a = bar.a)
               Buffers: shared hit=24
 Settings: effective_cache_size = '20GB', random_page_cost = '2', work_mem = '100MB'
 Planning:
   Buffers: shared hit=329
 Planning Time: 10.390 ms
 Execution Time: 0.373 ms
</code></pre></div></div>

<h2 id="searching-for-the-culprit">Searching for the culprit</h2>

<p>The main difference between the two plans is the join strategy. It’s a
<a href="https://postgrespro.com/blog/pgsql/5969770">Merge Join</a> in the first one, and a <strong>Nested Loop Join</strong> in the second one.</p>

<p>Since the two tables haven’t been analyzed or vacuumed, we know the statistics<sup id="fnref:stats" role="doc-noteref"><a href="#fn:stats" class="footnote" rel="footnote">1</a></sup> didn’t change between the
two executions. We observe that the first execution modifies a lot of buffers (<code class="language-plaintext highlighter-rouge">dirtied=71206</code>), and that’s our first clue.</p>

<p>I was surprised by the
<a href="https://explain.dalibo.com/plan/61cee01aa499bad0#plan/node/3">high total cost of the outer index scan</a>
in the
<code class="language-plaintext highlighter-rouge">Merge Join</code> node (145 millions), especially since the cost of the <code class="language-plaintext highlighter-rouge">Merge Join</code> itself is only 16,980.
In fact, the planner knows that only a small fraction
of one of the two <code class="language-plaintext highlighter-rouge">Merge Join</code> child nodes will be executed.
This occurs, for example, when the histograms of the join clause columns (<code class="language-plaintext highlighter-rouge">foo.a</code> and <code class="language-plaintext highlighter-rouge">bar.a</code>, in our case) have only
a small overlap, or no overlap
(I would like to thank Robert Haas for teaching me this when I asked about it on the
<a href="https://discord.gg/bx2G9KWyrY">PostgreSQL Hacking Discord</a>.</p>

<p>In our case, the histograms are as follows:</p>

<ul>
  <li>histogram of column <strong>foo.a</strong>: <code class="language-plaintext highlighter-rouge">{1532523860,1532923673, &lt;97 more values&gt;, 1573407772,1573803559}</code></li>
  <li>histogram of column <strong>bar.a</strong>: <code class="language-plaintext highlighter-rouge">{16877,15720140, &lt;97 more values&gt;, 1485178901,1499389426}</code></li>
</ul>

<p>So they actually don’t overlap at all.</p>

<p>When I saw this, I suddenly remembered a case from a few years ago when we had to deal
with a query whose planning time was absurdly high on the first execution. There were many index tuples
pointing to dead heap tuples.</p>

<p>When the optimizer computes the selectivity of one histogram,
it may probe the index to determine the actual extreme values.
(The function is <code class="language-plaintext highlighter-rouge">get_actual_variable_endpoint()</code> <sup id="fnref:get_actual_variable_endpoint" role="doc-noteref"><a href="#fn:get_actual_variable_endpoint" class="footnote" rel="footnote">2</a></sup>,
called by <code class="language-plaintext highlighter-rouge">get_actual_variable_range()</code>
when the first or last histogram entry is accessed
by the function <code class="language-plaintext highlighter-rouge">ineq_histogram_selectivity()</code>).</p>

<p>At that time, this function could end up
reading many heap pages, and also writing index pages in order to set the
<a href="https://www.cybertec-postgresql.com/en/killed-index-tuples/">dead flag</a>
for the index tuples pointing to dead heap tuples. This explained the very high planning time.</p>

<p>However, in November 2022, a
<a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9c6ad5eaa957bdc2132b900a96e0d2ec9264d39c">patch</a>
from Simon Riggs limited the number of heap pages <code class="language-plaintext highlighter-rouge">get_actual_variable_endpoint()</code> could fetch.
After reading 100 heap pages, it gives up, preventing the planning time explosion.</p>

<p>As a consequence, we have this strange case where the first execution of a query is planned
differently from the second one.
The first time, <code class="language-plaintext highlighter-rouge">get_actual_variable_endpoint()</code> gives up, so the planner uses
the extreme value recorded in the histogram.
The second time, if enough of the index tuples pointing to dead heap
tuples were <em>killed</em>, <code class="language-plaintext highlighter-rouge">get_actual_variable_endpoint()</code> successfully returns
the <em>actual</em> extreme value.
Thus, the planner works with more accurate information.</p>

<h2 id="verifying-the-hypothesis">Verifying the hypothesis</h2>

<h3 id="1st-execution">1st execution</h3>

<p>Assuming our hypothesis is correct, on the first execution, <code class="language-plaintext highlighter-rouge">get_actual_variable_endpoint()</code> gives up. Since our
histograms don’t overlap at all, the planner could estimate a null total cost for the <code class="language-plaintext highlighter-rouge">Merge Join</code> node. However,
the planner is wary of extremely low or large selectivity estimates<sup id="fnref:default_cutoff" role="doc-noteref"><a href="#fn:default_cutoff" class="footnote" rel="footnote">3</a></sup>,
so it estimates (by simplifying) that it will run at least a small fraction
(0.01 %, or one-hundredth of the histogram resolution)
of the outer child node, and, at worst, almost all (99.99 %) of the inner child node to find the first matching tuple, and
all of it to find them all. So the startup cost of the <code class="language-plaintext highlighter-rouge">Merge Join</code>
node is close to the total cost of the inner child node. Its run cost (total cost minus startup cost)
should be close to <code class="language-plaintext highlighter-rouge">0.0001</code> of the total cost of the outer child node.</p>

<p>Let’s do the math. In the above plan, the <code class="language-plaintext highlighter-rouge">Merge Join</code> startup cost is 2385, which is slightly higher than the
total cost of the inner index scan multiplied by 0.9999 (2326). Therefore, it’s consistent.
The run cost equals 16,980 - 2,385 = 14,595, which is slightly higher than the total cost of the outer index
scan multiplied by 0.0001 (145,690,068 × 0.0001 = 14,569). Once again, this is consistent with our hypothesis.</p>

<h3 id="2nd-execution">2nd execution</h3>

<p>Assuming our hypothesis is correct, on the second execution, <code class="language-plaintext highlighter-rouge">get_actual_variable_endpoint()</code> successfully returns the
actual extreme values, thanks to the already killed tuples in the index.
And we expect the total cost of the <code class="language-plaintext highlighter-rouge">Merge Join</code> to
be higher than the total cost of the <code class="language-plaintext highlighter-rouge">Nested Loop Join</code> node in the above “fast” plan, which is 544,104.</p>

<p>We can’t say anything
very accurate here, because we didn’t ask our customer for the real minimum value <code class="language-plaintext highlighter-rouge">foo.a</code>
and the maximum value <code class="language-plaintext highlighter-rouge">bar.a</code>.
Besides, it would be very tedious to unfold the computations
performed by the planner. Suffice it to say that if PostgreSQL estimates that it will have to run at least
0.5 % of the outer index scan, then the total cost of the <code class="language-plaintext highlighter-rouge">Merge Join</code> node exceeds
0.005 × 145,690,068 = 728,450, which is greater than the total cost of the <code class="language-plaintext highlighter-rouge">Nested Loop</code> node.
So the planner chooses the <code class="language-plaintext highlighter-rouge">Nested Loop</code>.</p>

<h2 id="a-script-to-reproduce-the-case">A script to reproduce the case</h2>

<p>The following script reproduces the case. First, we create two tables and disable autovacuum for them.</p>

<p>We insert one million rows into <code class="language-plaintext highlighter-rouge">foo</code>, with values for column <code class="language-plaintext highlighter-rouge">a</code> between 200,000 and 299,999.</p>

<p>We insert one million rows into <code class="language-plaintext highlighter-rouge">bar</code>, with values for column <code class="language-plaintext highlighter-rouge">a</code> between 100,000 and 199,999
(without any overlap with the <code class="language-plaintext highlighter-rouge">foo.a</code>).</p>

<p>We run <code class="language-plaintext highlighter-rouge">VACUUM ANALYZE</code> on both tables, and then insert 100,000 rows in <code class="language-plaintext highlighter-rouge">foo</code>,
with values for column <code class="language-plaintext highlighter-rouge">a</code> between 185,000 and 195,000 (inclusive),
so there is now some overlap. The autovacuum is inhibited here,
and we can verify that the histograms don’t overlap in the statistics view. We create the indexes.</p>

<p>Then we delete all the rows that were inserted by the last <code class="language-plaintext highlighter-rouge">INSERT</code>,
except the last value (<code class="language-plaintext highlighter-rouge">a</code> = 195,000).
So the minimum value for <code class="language-plaintext highlighter-rouge">foo.a</code> is now 195,000, but the beginning of the index still contains many
entries pointing to dead heap tuples, and <code class="language-plaintext highlighter-rouge">get_actual_variable_endpoint()</code> will abort and return <code class="language-plaintext highlighter-rouge">false</code>
on the first execution.</p>

<p>When you run the script, the last command should output a plan
with an underestimated <code class="language-plaintext highlighter-rouge">Merge Join</code> node.
You’ll see <em>Heap Fetches</em>, hinting that dead tuples entries are cleaned in the index.</p>

<p>Running this EXPLAIN command a second time
will result in a <code class="language-plaintext highlighter-rouge">Nested Loop Join</code>.
With <code class="language-plaintext highlighter-rouge">SET enable_nestloop TO off</code>, you can
verify that the cost of the <code class="language-plaintext highlighter-rouge">Merge Join</code>is much higher.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">DROP</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">EXISTS</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="n">UNLOGGED</span> <span class="k">TABLE</span> <span class="n">foo</span><span class="p">(</span><span class="n">id</span> <span class="nb">int</span><span class="p">,</span> <span class="n">a</span> <span class="nb">int</span><span class="p">);</span>
<span class="k">CREATE</span> <span class="n">UNLOGGED</span> <span class="k">TABLE</span> <span class="n">bar</span><span class="p">(</span><span class="n">id</span> <span class="nb">int</span><span class="p">,</span> <span class="n">a</span> <span class="nb">int</span><span class="p">);</span>

<span class="k">ALTER</span> <span class="k">TABLE</span> <span class="n">foo</span> <span class="k">SET</span> <span class="p">(</span><span class="n">autovacuum_enabled</span> <span class="o">=</span> <span class="k">off</span><span class="p">);</span>
<span class="k">ALTER</span> <span class="k">table</span> <span class="n">bar</span> <span class="k">SET</span> <span class="p">(</span><span class="n">autovacuum_enabled</span> <span class="o">=</span> <span class="k">off</span><span class="p">);</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">foo</span> <span class="k">SELECT</span> <span class="n">i</span><span class="p">,</span>  <span class="mi">200000</span> <span class="o">+</span> <span class="n">random</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100000</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">999999</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">bar</span> <span class="k">SELECT</span> <span class="n">i</span><span class="o">%</span><span class="mi">100000</span><span class="p">,</span> <span class="mi">100000</span> <span class="o">+</span> <span class="n">i</span><span class="o">/</span><span class="mi">10</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">999999</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span>

<span class="k">VACUUM</span> <span class="k">ANALYZE</span> <span class="n">foo</span><span class="p">,</span> <span class="n">bar</span><span class="p">;</span>

<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">foo</span> <span class="k">SELECT</span> <span class="mi">1000000</span> <span class="o">+</span> <span class="n">i</span><span class="p">,</span> <span class="mi">185000</span> <span class="o">+</span> <span class="n">i</span><span class="o">/</span><span class="mi">10</span> <span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100000</span><span class="p">)</span> <span class="n">i</span><span class="p">;</span>

<span class="k">SELECT</span> <span class="o">*</span> <span class="k">FROM</span> <span class="n">pg_stats</span> <span class="k">WHERE</span> <span class="n">tablename</span> <span class="k">IN</span> <span class="p">(</span><span class="s1">'foo'</span><span class="p">,</span> <span class="s1">'bar'</span><span class="p">)</span> <span class="k">AND</span> <span class="n">ATTNAME</span> <span class="o">=</span> <span class="s1">'a'</span><span class="p">;</span>

<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">foo</span><span class="p">(</span><span class="n">a</span><span class="p">);</span>
<span class="k">CREATE</span> <span class="k">INDEX</span> <span class="k">ON</span> <span class="n">bar</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">a</span><span class="p">);</span>

<span class="k">DELETE</span> <span class="k">FROM</span> <span class="n">foo</span> <span class="k">WHERE</span> <span class="n">a</span> <span class="o">&gt;=</span> <span class="mi">185000</span> <span class="k">and</span> <span class="n">a</span> <span class="o">&lt;</span> <span class="mi">195000</span><span class="p">;</span>

<span class="k">EXPLAIN</span> <span class="p">(</span><span class="k">ANALYZE</span><span class="p">,</span> <span class="n">BUFFERS</span><span class="p">,</span> <span class="n">SETTINGS</span><span class="p">)</span>
<span class="k">SELECT</span> <span class="n">foo</span><span class="p">.</span><span class="n">a</span> <span class="k">FROM</span> <span class="n">foo</span> <span class="k">JOIN</span> <span class="n">bar</span> <span class="k">USING</span> <span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">WHERE</span> <span class="n">bar</span><span class="p">.</span><span class="n">id</span> <span class="o">=</span> <span class="mi">2047</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="n">a</span><span class="p">;</span>
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>

<p>We found a case in which the query plan can change between two executions, while the data and statistics
remain exactly the same. As far as I know, there are no other known cases<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">4</a></sup>, but if you know of one, or
have any questions or feedback about this article, please contact Frédéric Yhuel at
<a href="mailto:frederic.yhuel@dalibo.com">frederic.yhuel@dalibo.com</a>.</p>

<p>One more thing: typically, estimation errors in the planner result in an
output row count that differs greatly from the actual count, for a given node. This kind of
discrepancy is easy to spot, and graphical tools like <a href="https://explain.dalibo.com">explain.dalibo.com</a> and <a href="https://explain.depesz.com">explain.depesz.com</a> highlight it.
A common example is when the planner selects a <code class="language-plaintext highlighter-rouge">Nested Loop Join</code> instead of a <code class="language-plaintext highlighter-rouge">Hash Join</code> because the estimated
number of output rows of the outer child node is significantly underestimated. The reverse also happens, albeit
less frequently. However, we have encountered another type of estimation error that is much more difficult to spot,
because it comes from the direct use of indexes by the planner.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:stats" role="doc-endnote">
      <p>Not only those contained in <code class="language-plaintext highlighter-rouge">pg_statistics</code>, but also those in <code class="language-plaintext highlighter-rouge">pg_class</code>.</p>

      <p>The latter can also be updated by an index creation or a reindexation, but this was not the case here. <a href="#fnref:stats" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:get_actual_variable_endpoint" role="doc-endnote">
      <p>See <a href="https://github.com/postgres/postgres/blob/6c99c715ddb338e169c2ffd2a4cf754fa510cccb/src/backend/utils/adt/selfuncs.c#L6753">selfuncs.c</a> <a href="#fnref:get_actual_variable_endpoint" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:default_cutoff" role="doc-endnote">
      <p>See <a href="https://github.com/postgres/postgres/blob/REL_18_STABLE/src/backend/utils/adt/selfuncs.c#L996">selfuncs.c</a> <a href="#fnref:default_cutoff" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:1" role="doc-endnote">
      <p>I can think of one other case, involving GEQO and a different GEQO seed, but it’s not very interesting to me. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="optimizer,planner" /><summary type="html"><![CDATA[This post appeared first on the Dalibo blog.]]></summary></entry><entry><title type="html">What are these slow COMMIT in my PostgreSQL logs?</title><link href="https://yhuelf.github.io/2021/09/30/pg_stat_statements_bottleneck.html" rel="alternate" type="text/html" title="What are these slow COMMIT in my PostgreSQL logs?" /><published>2021-09-30T10:15:25+00:00</published><updated>2021-09-30T10:15:25+00:00</updated><id>https://yhuelf.github.io/2021/09/30/pg_stat_statements_bottleneck</id><content type="html" xml:base="https://yhuelf.github.io/2021/09/30/pg_stat_statements_bottleneck.html"><![CDATA[<p>I recently came across these surprising logs over the course of an audit at one
of <a href="https://dalibo.com/">Dalibo</a>’s client:
(the logs are strimmed for better readability and anonymisation).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Jul 26 00:07:51 postgres10[42230]: client=127.0.0.1 LOG:  duration: 406.403 ms  statement: select 1
Jul 26 00:08:02 postgres10[42237]: client=127.0.0.1 LOG:  duration: 780.613 ms  statement: COMMIT
</code></pre></div></div>

<p>I could come up with an explanation for the slow COMMIT (heavy writes =&gt; many
WAL buffers to sync on disk, and maybe WAL compression added some latency) but
I wouldn’t know what to say to the client about this very
long <code class="language-plaintext highlighter-rouge">select 1</code>. This query clearly came from PgBouncer: “a simple do-nothing
query to check if the server connection is alive”, but that didn’t help much.</p>

<figure class="image">
  <img src="/img/babar.jpg" alt="Me (embarrassed) and
my client." />
  <figcaption>Me (embarrassed) and
my client.</figcaption>
</figure>

<p>Let’s have a look at <code class="language-plaintext highlighter-rouge">pg_stat_statements</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>postgres=# SELECT rolname, datname, query, calls, min_time, max_time, mean_time
  FROM pg_stat_statements s JOIN pg_database d ON (s.dbid = d.oid)
  JOIN pg_roles r ON (r.oid = s.userid) WHERE octet_length(query) &lt; 15
  AND rolname = 'anon' AND datname = 'anon';

 rolname | datname |    query    |    calls    | min_time |  max_time   |      mean_time      
---------+---------+-------------+-------------+----------+-------------+---------------------
 anon    | anon    | COMMIT      |  2078240339 | 0.000247 |   59.077196 | 0.00119367747128697
 anon    | anon    | ROLLBACK    |          81 | 0.000689 |    0.013588 | 0.00160246913580247
 anon    | anon    | BEGIN       |  2121191643 | 0.000251 |   57.558841 | 0.00114355394816151
 anon    | anon    | DISCARD ALL | 17515013300 | 0.004349 | 1442.615547 |  0.0350356836662835
(4 rows)
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">select 1</code> aren’t there, and the longest <code class="language-plaintext highlighter-rouge">COMMIT</code> lasts 59ms. So what’s
going on? The normalized query <code class="language-plaintext highlighter-rouge">select $1</code> has probably been evicted from
pg_stat_statements over the course of a deallocation, although its frequency
is rather high. But the parameter <code class="language-plaintext highlighter-rouge">pg_stat_statements.max</code> is set to <code class="language-plaintext highlighter-rouge">1000</code>, and
this a heavy loaded server with about 30,000 tx/s and 1200 backends during the
busiest periods. Now, we still don’t have any explaination about the
inconsistency between pg_stat_statements and the logs, regarding the slow
COMMITs.</p>

<p>So I started to look at the code (pg_stat_statements.c). Initially, I aimed at
understanding better how the query time was computed. But I didn’t go too far,
because I read this line of comment:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Note about locking issues: to create or delete an entry in the shared
hashtable, one must hold pgss-&gt;lock exclusively
</code></pre></div></div>

<p>… and then it clicked in my mind: it was probably a locking issue!</p>

<p>Or maybe not.</p>

<p>Anyway, it’s not too difficult to test this hypothesis. Let’s use pgbench with
16 custom scenarii like this one:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BEGIN;
SELECT 1 FROM t1;
SELECT 1 FROM t2;
SELECT 1 FROM t3;
SELECT 1 FROM t4;
SELECT 1 FROM t5;
SELECT 1 FROM t6;
SELECT 1 FROM t7;
SELECT 1 FROM t8;
COMMIT;
</code></pre></div></div>

<p>The queries are unique, event after normalization.</p>

<p>Here is a script to create the scenarii and the tables:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>

psql <span class="nt">-c</span> <span class="s2">"CREATE DATABASE pgbench;"</span>

<span class="k">for </span>x <span class="k">in</span> <span class="sb">`</span><span class="nb">seq </span>1 128<span class="sb">`</span><span class="p">;</span> <span class="k">do
        </span>psql <span class="nt">-d</span> pgbench <span class="nt">-c</span> <span class="s2">"CREATE TABLE t</span><span class="nv">$x</span><span class="s2"> (id int);"</span>
<span class="k">done


</span>foo<span class="o">()</span> <span class="o">{</span>
        <span class="nv">x</span><span class="o">=</span><span class="nv">$1</span>
        <span class="nv">start</span><span class="o">=</span><span class="k">$((</span><span class="m">1</span> <span class="o">+</span> x<span class="o">*</span><span class="m">8</span><span class="k">))</span>
        <span class="nv">stop</span><span class="o">=</span><span class="k">$((</span><span class="m">8</span> <span class="o">+</span> x<span class="o">*</span><span class="m">8</span><span class="k">))</span>

        <span class="nb">echo</span> <span class="s2">"BEGIN;"</span>
        <span class="k">for </span>x <span class="k">in</span> <span class="sb">`</span><span class="nb">seq</span> <span class="nv">$start</span> <span class="nv">$stop</span><span class="sb">`</span><span class="p">;</span> <span class="k">do
                </span><span class="nb">echo</span> <span class="s2">"SELECT 1 FROM t</span><span class="nv">$x</span><span class="s2">;"</span>
        <span class="k">done
        </span><span class="nb">echo</span> <span class="s2">"COMMIT;"</span>
<span class="o">}</span>

<span class="k">for </span>y <span class="k">in</span> <span class="sb">`</span><span class="nb">seq </span>0 15<span class="sb">`</span><span class="p">;</span> <span class="k">do
        </span><span class="nv">f</span><span class="o">=</span><span class="s2">"scenario_</span><span class="nv">$y</span><span class="s2">.sql"</span>
        foo <span class="nv">$y</span> <span class="o">&gt;</span> <span class="nv">$f</span>
<span class="k">done</span>
</code></pre></div></div>

<p>These 16 scenarii amounts to 128 unique queries, which we need to reproduce the
locking problem, since <code class="language-plaintext highlighter-rouge">pg_stat_statements.max</code> lowest possible value is <code class="language-plaintext highlighter-rouge">100</code>.</p>

<p>And here is another one to launch 16 pgbench in parallel, one per scenario:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>

psql <span class="nt">-d</span> pgbench <span class="nt">-c</span> <span class="s2">"SELECT pg_stat_reset();"</span>
psql <span class="nt">-d</span> pgbench <span class="nt">-c</span> <span class="s2">"SELECT pg_stat_statements_reset();"</span>

<span class="k">for </span>x <span class="k">in</span> <span class="sb">`</span><span class="nb">seq </span>0 15<span class="sb">`</span><span class="p">;</span> <span class="k">do
        </span>pgbench <span class="nt">-d</span> pgbench <span class="nt">-T</span> 100 <span class="nt">-f</span> scenario_<span class="nv">$x</span>.sql 2&gt; /dev/null &amp;
        pids[<span class="k">${</span><span class="nv">i</span><span class="k">}</span><span class="o">]=</span><span class="nv">$!</span>
<span class="k">done

for </span>pid <span class="k">in</span> <span class="k">${</span><span class="nv">pids</span><span class="p">[*]</span><span class="k">}</span><span class="p">;</span> <span class="k">do
    </span><span class="nb">wait</span> <span class="nv">$pid</span>
<span class="k">done

</span>psql <span class="nt">-c</span> <span class="s2">"select xact_commit from pg_stat_database where datname = 'pgbench';"</span>
psql <span class="nt">-d</span> pgbench <span class="nt">-c</span> <span class="s2">"select * from pg_stat_statements_info;"</span>
</code></pre></div></div>

<p>The last line of the above script uses a view that appeared with PostgreSQL 14.
Indeed it seems I’m not the only one to have noticed this problem
<a href="https://www.postgresql.org/message-id/0d9f1107772cf5c3f954e985464c7298%40oss.nttdata.com">(see this thread on pgsql-hackers)</a>,
and it’s now possible to know how many deallocations occured since the last reset
of pg_stat_statements. A high number for a small period indicates that
<code class="language-plaintext highlighter-rouge">pg_stat_statements.max</code> is too low.</p>

<p>Let’s launch the pgbench script with <code class="language-plaintext highlighter-rouge">pg_stat_statements.max = 400</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> xact_commit 
-------------
      958181
(1 row)

 dealloc |          stats_reset          
---------+-------------------------------
       0 | 2021-10-01 16:35:51.180729+02
</code></pre></div></div>

<p>So we have an average of 9582 tx/s, and no deallocations.</p>

<p>Let’s launch the pgbench script again, with <code class="language-plaintext highlighter-rouge">pg_stat_statements.max = 100</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> xact_commit 
-------------
      714498
(1 row)

 dealloc |          stats_reset          
---------+-------------------------------
  144878 | 2021-10-01 16:33:19.156198+02
</code></pre></div></div>

<p>That’s 25% less transactions!</p>

<p>Also, in the logs, we observe 4 queries that lasted more than 20ms:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2021-10-04 08:43:53.584 CEST [54262] LOG:  duration: 23.487 ms  statement: SELECT 1 FROM t63;
2021-10-04 08:43:56.335 CEST [54270] LOG:  duration: 21.671 ms  statement: SELECT 1 FROM t21;
2021-10-04 08:44:04.226 CEST [54265] LOG:  duration: 20.455 ms  statement: SELECT 1 FROM t109;
2021-10-04 08:44:28.480 CEST [54271] LOG:  duration: 48.653 ms  statement: COMMIT;
</code></pre></div></div>

<p>… compared to only one when <code class="language-plaintext highlighter-rouge">pg_stat_statements.max</code> is set to <code class="language-plaintext highlighter-rouge">400</code> (and
pgbench lauched with rate limiting, so that the transaction rate matches the
previous test).</p>

<p>What is the cause of this performance drop? It is clearly linked to these 144878
deallocations, but is it really the locking problem that I thought of? Let’s use
<code class="language-plaintext highlighter-rouge">perf</code> together with <a href="https://www.brendangregg.com/perf.html#FlameGraphs">Flame Graph</a>
for a On-CPU analysis of one of the 16 backends:</p>

<p><img src="/img/pg_stat_statements_v14_max_100.svg" alt="perf100" />
<img src="/img/pg_stat_statements_v14_max_400.svg" alt="perf400" /></p>

<p>The two flame graphs look similar, except for the leftmost tower of the top one
(<code class="language-plaintext highlighter-rouge">pg_stat_statements.max = 100</code>). Notice the WaitEventSetWait frame that
accounts for 7.4% of all samples.</p>

<blockquote>
  <p><strong>Note:</strong> You can download the svg files (right-click on the image) and run them
in your browser to get the full Flame Graph experience.</p>
</blockquote>

<p>If it’s really a locking problem, a
<a href="https://www.brendangregg.com/offcpuanalysis.html">Off-CPU analysis</a> is more adequate:</p>

<p><img src="/img/pg_stat_statements_v14_max_100_offcpu.svg" alt="off100" />
<img src="/img/pg_stat_statements_v14_max_400_offcpu.svg" alt="offf400" /></p>

<p>This time, the two graphs are very different!</p>

<p>The tower containing the <code class="language-plaintext highlighter-rouge">futex_wait</code> accounts for 28% percent of the off-cpu
time, and I think it explains completely the performance drop. I’m unsure how to
explain the other big tower (in the middle), but here’s my guess: this is off-cpu
time which would otherwise have been caused by involontary context switches, in
the absence of locking within pg_stat_statements.</p>

<h1 id="looking-at-wait-events">Looking at wait events</h1>

<p><a href="https://postgrespro.com/">PostgresPro</a> provides a very nice extension,
<a href="https://github.com/postgrespro/pg_wait_sampling/">pg_wait_sampling</a>, which we
can use to understand what is slowing down a given backend :</p>

<p>Let’s launch the pgbench script again, with <code class="language-plaintext highlighter-rouge">pg_stat_statements.max = 100</code>, and
let’s query the <code class="language-plaintext highlighter-rouge">pg_wait_sampling_profile</code> view :</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ PID=33903
$ psql -d pgbench -c "SET pg_wait_sampling.history_size = 100000;
&gt; SELECT pg_wait_sampling_reset_profile();
&gt; SELECT pg_sleep(10);
&gt; SELECT event_type, event, count FROM pg_wait_sampling_profile WHERE pid=$PID"

 event_type |       event        | count
------------+--------------------+-------
 LWLock     | pg_stat_statements |    83
 Client     | ClientRead         |   364
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">LWLock</code> stands for <strong>lightweight lock</strong>, and most of them protect a particular
data structure in shared memory. In this case, we also get the name of the
extension that is waiting for this lock, and with no surprise, it’s
<strong>pg_stat_statements</strong>.</p>

<p>With <code class="language-plaintext highlighter-rouge">pg_stat_statements.max = 400</code>, we get :</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> event_type |   event    | count
------------+------------+-------
 Client     | ClientRead |    25
</code></pre></div></div>

<p>No lightweight locks, and much less <code class="language-plaintext highlighter-rouge">ClientRead</code> events, which I’m unsure how to
explain, but which is consistant with the Off-CPU flame graph (<code class="language-plaintext highlighter-rouge">ClientRead</code>
events probably match the <code class="language-plaintext highlighter-rouge">WaitEventSetWait</code> tower, and <code class="language-plaintext highlighter-rouge">LWLock</code> events match
the <code class="language-plaintext highlighter-rouge">futex_wait</code> tower).</p>

<h1 id="conclusion">Conclusion</h1>

<p>The default value of <code class="language-plaintext highlighter-rouge">pg_stat_statements.max</code> is <code class="language-plaintext highlighter-rouge">5000</code>. For some unknown
reason, the client decreased this value to <code class="language-plaintext highlighter-rouge">1000</code>. He probably assumed that the
performance penalty of having pg_stat_statements loaded would be lesser, but on
the contrary it resulted in a much bigger one. The documentation doesn’t say
anything about the danger of setting this value too low.
Only recently, in version 14,
we can rely on the view <code class="language-plaintext highlighter-rouge">pg_stat_statements_info</code>, containing the
counter <code class="language-plaintext highlighter-rouge">dealloc</code>and the timestamp <code class="language-plaintext highlighter-rouge">stats_reset</code>, to get an idea about whether
or not this parameter is set too low. On my 4-cpu machine, the performance drop
is noticeable with more than 700 deallocations per second, but to be on the safe
side, I would recommend to increase this parameter if there’s more than 10
deallocations per second, also in order to avoid the extra CPU cost of
these deallocations.
For PostgreSQL versions prior to <code class="language-plaintext highlighter-rouge">14</code>, there’s no simple way of knowing if this
parameter is set high enough, but the presence of slow queries that shouldn’t be
slow in your logs is a good hint. Anyway, I would recommend sticking with the
default value of <code class="language-plaintext highlighter-rouge">pg_stat_statements.max</code>, or increasing it, but never
decreasing it.</p>

<p><em>license: © 2021 – <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a></em></p>]]></content><author><name></name></author><category term="PostgreSQL" /><category term="performance" /><category term="pg_stat_statements" /><summary type="html"><![CDATA[I recently came across these surprising logs over the course of an audit at one of Dalibo’s client: (the logs are strimmed for better readability and anonymisation).]]></summary></entry></feed>