Skip to content

fix(sql): return the last N rows for LIMIT -N over a covering index#7182

Merged
bluestreak01 merged 14 commits into
masterfrom
fix-covering-index-negative-limit
Jun 2, 2026
Merged

fix(sql): return the last N rows for LIMIT -N over a covering index#7182
bluestreak01 merged 14 commits into
masterfrom
fix-covering-index-negative-limit

Conversation

@bluestreak01

@bluestreak01 bluestreak01 commented May 29, 2026

Copy link
Copy Markdown
Member

Fixes #7181

Summary

SELECT ... LIMIT -N over a covering (posting) index with a residual filter returned the first N rows instead of the last N. The result was silently wrong — no error, no warning. This affected single-key queries (WHERE sym = 'A' AND price > 0 LIMIT -3) and the same shape with a multi-key (sym IN (...)) predicate or a bind-variable limit.

The covering factory ignored the requested scan order when serving a page-frame cursor: getPageFrameCursor(executionContext, order) always opened ascending partition frames, while getScanDirection() advertised FORWARD. The parallel negative-limit machinery then collected the lowest-timestamp rows believing they were the highest.

Fix

The two query shapes need different handling because they have different ordering guarantees.

Single-key covering queries are globally timestamp-ordered (partitions ascending, rows ascending within a partition), so a backward scan yields the true last N rows. getPageFrameCursor now honors ORDER_DESC:

  • iterate partitions latest-to-earliest, and
  • within each partition, split the row range into sub-frames at most maxRowsPerFrame wide and emit the highest row-range sub-frame first, each read forward so rows stay ascending within a frame.

This mirrors BwdTableReaderPageFrameCursor and stays correct when a partition holds more than maxRowsPerFrame matching rows — a case where a partition-reverse-only fix would still return wrong results. Parallel filtering is preserved for this shape.

Multi-key covering is only per-key-per-partition ordered, never globally timestamp-ordered, so a correct last-N-by-timestamp cannot be produced from it without a k-way merge. When the limit may be negative (a constant below zero, or a bind variable whose sign is unknown at compile time), codegen routes multi-key queries to the serial FilteredRecordCursorFactory, where LimitRecordCursorFactory computes last-N via size plus skip. A defensive guard in getPageFrameCursor rejects a backward scan on the multi-key path.

getScanDirection() stays FORWARD so ORDER BY ts DESC is not wrongly elided onto the ascending cursor.

Tradeoffs

  • Single-key negative-limit covering queries keep parallel filtering and now read fewer rows for small N (early termination from the high end), instead of scanning ascending and discarding.
  • Multi-key (and bind-variable) negative-limit covering queries forfeit parallel filtering for this narrow shape: they fall back to the serial filter plus a Limit node. The bind-variable case takes the serial path even when the value turns out positive at runtime, because the sign is unknown at compile time. This is a correctness-for-throughput trade on an otherwise silently-wrong path; positive-limit and no-limit multi-key queries are unaffected and still run in parallel.

Test plan

  • New regression tests in CoveringIndexTest:
    • the issue reproducer (10 single-row partitions, LIMIT -3 returns 8,9,10), including a filter that eliminates the low rows and a larger limit spanning all partitions
    • a multi-frame-per-partition variant with setMaxRowsPerFrameForTesting(2), which a partition-reverse-only fix would still fail, across -3/-4/-5 to cross sub-frame boundaries
    • a bind-variable limit on a single-key query, asserting last-N for a negative value and head-N for a positive value on the same compiled statement
    • a multi-key query asserting the plan does not use the parallel Async Filter path and returns the correct tail, while a positive multi-key limit still does
  • Confirmed the single-key tests fail on the unpatched code and pass with the fix.
  • Full CoveringIndexTest (378), PostingIndexCriticalIssuesTest (40), and CoveringCompressorTest (17) pass.

Summary by CodeRabbit

New Features

  • Added support for negative LIMIT values on covering index queries, enabling efficient retrieval of the last N rows

Bug Fixes

  • Fixed row ordering correctness for multi-key covering index queries to maintain timestamp order across partitions
  • Improved LIMIT pushdown behavior on covering indexes to prevent incorrect results when ORDER BY clauses are present

Tests

  • Added comprehensive test suite verifying covering index query correctness across multiple scenarios including filters, projections, and LIMIT variations

A negative LIMIT over a posting/covering index with a residual filter
silently returned the first N rows instead of the last N. The covering
factory ignored the requested scan order when serving a page-frame
cursor: getPageFrameCursor always opened ascending partition frames
while getScanDirection() advertised FORWARD, so the parallel
negative-limit machinery collected the lowest-timestamp rows believing
they were the highest.

Single-key covering queries now honor ORDER_DESC with a genuine
backward scan: DESC partition iteration plus per-partition row-range
sub-frames emitted high-range-first, each read forward so rows stay
ascending within a frame. This mirrors BwdTableReaderPageFrameCursor
and stays correct when a partition holds more than maxRowsPerFrame
matching rows, where a partition-reverse-only fix would not. Parallel
filtering is preserved for this shape.

Multi-key covering is only per-key-per-partition ordered, never
globally timestamp-ordered, so a correct last-N-by-ts cannot come from
it without a k-way merge. When the limit may be negative (a constant
below zero, or a bind variable whose sign is unknown at compile time),
codegen routes multi-key queries to the serial FilteredRecordCursorFactory,
where LimitRecordCursorFactory computes last-N via size plus skip. A
defensive guard in getPageFrameCursor rejects a backward scan on the
multi-key path.

getScanDirection() stays FORWARD so ORDER BY ts DESC is not wrongly
elided onto the ascending cursor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bluestreak01 bluestreak01 added Bug Incorrect or unexpected behavior Core Related to storage, data type, etc. SQL Issues or changes relating to SQL execution labels May 29, 2026
@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ff50fbbd-0b20-47ba-9bad-4f077016a869

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR fixes a correctness bug where LIMIT -N over a covering index with a residual filter incorrectly returns the first N rows instead of the last N. It implements two levels of defense: codegen-time guards to disable unsafe parallel pushdown, and a new backward-scan path for single-key page-frame cursors. Multi-key iteration is rewritten to use k-way merging for correct timestamp ordering across keys.

Changes

Covering Index Negative LIMIT and Ordering Fixes

Layer / File(s) Summary
LIMIT pushdown decision logic
core/src/main/java/io/questdb/griffin/SqlCodeGenerator.java
Adds coveringScanOrderMatchesQuery() to check ORDER BY alignment and mayBeNegativeLimit() to detect negative LIMIT values. Reworks wrapCoveringWithFilter() to compute pushLimit only when covering scan order matches the query; disables pushdown when negative-limit behavior is possible but the covering factory cannot serve it, falling back to serial path and freeing the async limit function to avoid leaks.
Page-frame cursor ordering support
core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java (lines 27–450)
Adds CairoException import and updates getPageFrameCursor() to compute a descending flag from the requested order, throws exception for unsupported multi-key backward scans, and routes to the appropriate page-frame cursor. Introduces supportsNegativeLimitPageFrame() to report single-key support and adds openForwardCoveringCursor() helper.
Page-frame buffer and fill refactoring
core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java (lines 746–1411)
Refactors CoveringPageFrameCursor to introduce a descending flag and variable maxRowsPerFrame, makes buffer methods protected, and splits frame construction into allocFrameBuffers() and finalizeFrame() stages. Updates fillFrameForKey() signature to accept explicit rowCap for sub-frame capping. Changes writeCoveredRow() visibility to protected for subclass reuse.
Multi-key k-way merge for timestamp ordering
core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java (lines 1850–2161)
Overwrites MultiKeyCoveringCursor to replace per-partition drain with explicit per-key cursor arrays and k-way row-id merging in hasNext(), ensuring rows emit in global timestamp order. Rewrites MultiKeyCoveringPageFrameCursor to resume merges across frames via mergePartitionIndex and select rows by smallest head row id.
Single-key backward-scan implementation
core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java (lines 2255–2368)
Implements descending (negative-limit) iteration in SingleKeyCoveringPageFrameCursor by splitting partition row ranges into capped sub-frames and emitting the highest sub-frame first. Adds descending state (descPartitionIndex, descPartitionLo, descSubHi), routes nextImpl() to new nextImplDescending(), and resets descending state in resetIterationState().
Comprehensive multi-key ordering test suite
core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java
Introduces new test class with 15 cross-check test methods validating that covering query results match oracle (/*+ no_covering */) across varied shapes: multiple keys, partitions, residual filters, projections, negative limits, bind variables, and explicit ORDER BY. Includes helpers to verify CoveringIndex plan usage and build deterministic interleaved test data.
Test expectation updates and negative LIMIT regressions
core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexTest.java
Updates 7 existing IN-list test assertions to reflect timestamp-ordered output across keys/partitions instead of key-grouped. Adds 4 new regression tests: testNegativeLimitSingleKeyReturnsLastRows, testNegativeLimitSingleKeyMultiFramePerPartition, testNegativeLimitSingleKeyBindVariable, and testNegativeLimitMultiKeyFallsBackToSerial, covering both positive and negative limits, bind variables, and multi-frame partitions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • puzpuzpuz
  • nwoolmer

Poem

🐰 A rabbit hops through frames both ways,
Backward scans through interleaved days,
Multi-key merging in perfect time,
No more wrong limits—the order's sublime!
Last N rows now correctly flow,
Negative limits steal the show. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main fix: enabling LIMIT -N over covering indexes to correctly return the last N rows instead of the first N rows.
Linked Issues check ✅ Passed All coding objectives from issue #7181 are met: single-key backward scan support, multi-key fallback routing, proper ORDER BY handling, and comprehensive regression tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing LIMIT -N behavior over covering indexes. No unrelated refactoring or scope creep detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-covering-index-negative-limit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

bluestreak01 and others added 3 commits May 29, 2026 19:01
Multi-key queries over a covering (posting) index returned rows
grouped by key -- each key's posting list drained in full before the
next -- instead of in designated-timestamp order. That disagreed with
the plain index path on the same data for every shape that expects
timestamp order: bare scans, positive and negative LIMIT, and even an
explicit ORDER BY ts, whose sort was elided onto the unordered cursor.

Both covering cursors now k-way merge the per-key cursors by row id,
mirroring HeapRowCursorFactory on the non-covering path:

- the serial record cursor holds one open cursor per key and emits the
  smallest row id each step;
- the parallel page-frame cursor merges keys into timestamp-ordered
  frames (symbol key written per row), chunked at maxRowsPerFrame with
  the merge state resumed across frames.

Separately, ORDER BY ts DESC with a LIMIT returned the wrong end even
for a single key: codegen pushed the limit into the ascending parallel
scan ahead of the sort, truncating the oldest rows so the sort then
reversed those. The limit is now pushed into the scan only when the
covering forward scan order is already the query's final order;
otherwise the outer LimitRecordCursorFactory applies it after the sort.

Multi-key negative limits still use the serial path, as the page-frame
merge is forward-only. Updated the existing covering tests that had
encoded the old key-grouped order, and added a cross-check suite that
compares the covering path against the no_covering path across data
layouts, filters, limits and orderings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java (2)

336-339: ⚡ Quick win

Consider using assertQueryNoLeakCheck() for factory property validation.

The current implementation uses printSql + Assert.assertEquals, which only validates data correctness. assertQueryNoLeakCheck() additionally validates factory properties (supportsRandomAccess, expectSize, expectedTimestamp), providing more comprehensive test coverage.

🔍 Example refactor

If there's a suitable assertQueryNoLeakCheck overload that accepts expected results as a string:

 private void assertResult(String expected, String query) throws SqlException {
-    printSql(query);
-    Assert.assertEquals(query, expected, sink.toString());
+    assertQueryNoLeakCheck(expected, query, null, null, true, true);
 }

Note: Adjust the overload and parameters based on the actual AbstractCairoTest API.

Based on learnings: Use assertQueryNoLeakCheck() to assert the results of queries. This method asserts factory properties (supportsRandomAccess, expectSize, expectedTimestamp) in addition to data correctness.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`
around lines 336 - 339, Replace the current assertion that only checks data
correctness in assertResult (which calls printSql and Assert.assertEquals
against sink.toString()) with a call to assertQueryNoLeakCheck so factory
properties are also validated; locate the assertResult method and its callers
and change the assertion to use assertQueryNoLeakCheck with the same expected
string and query (adjust parameters to the available overload on
AbstractCairoTest), removing the manual printSql/Assert.assertEquals usage that
compares sink.toString().

127-129: 💤 Low value

Consider using Java 17 text blocks for SQL statements.

The INSERT statements use string concatenation. Text blocks (multiline string literals) would improve readability and align with modern Java 17 practices recommended in the coding guidelines.

📝 Example refactor for line 127-129
-            execute("INSERT INTO t_disj " +
-                    "SELECT dateadd('d', x::INT, '2024-01-01T00:00:00Z'::TIMESTAMP), " +
-                    "CASE WHEN x <= 5 THEN 'A' ELSE 'B' END, x::DOUBLE FROM long_sequence(10)");
+            execute("""
+                    INSERT INTO t_disj
+                    SELECT dateadd('d', x::INT, '2024-01-01T00:00:00Z'::TIMESTAMP),
+                           CASE WHEN x <= 5 THEN 'A' ELSE 'B' END,
+                           x::DOUBLE
+                    FROM long_sequence(10)
+                    """);

Apply similar transformation to other INSERT statements throughout the file.

As per coding guidelines: Use modern Java 17 features: enhanced switch, multiline string literals, and pattern variables in instanceof checks.

Also applies to: 157-159, 210-211, 225-226, 291-293, 345-347

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`
around lines 127 - 129, Replace the concatenated SQL string passed to
execute(...) in CoveringIndexMultiKeyOrderingTest (the INSERT INTO t_disj
statement) with a Java 17 text block to improve readability; locate the
execute(...) call in the test class (CoveringIndexMultiKeyOrderingTest) and
convert the multi-part string to a single triple-quoted text block containing
the full SQL, then apply the same refactor to the other INSERT statements
mentioned (the execute(...) calls around lines referenced: 157-159, 210-211,
225-226, 291-293, 345-347) so all SQL literals use multiline string literals
instead of concatenation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`:
- Around line 68-93: The two static String[] fields are out of alphabetical
order; move EXPLICIT_ORDER_TAILS so it appears before TS_ORDER_TAILS to satisfy
member ordering rules. Edit the class to swap the declarations of
EXPLICIT_ORDER_TAILS and TS_ORDER_TAILS (referencing the exact symbols
EXPLICIT_ORDER_TAILS and TS_ORDER_TAILS) so static fields remain grouped and
sorted alphabetically without changing their contents.

---

Nitpick comments:
In
`@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`:
- Around line 336-339: Replace the current assertion that only checks data
correctness in assertResult (which calls printSql and Assert.assertEquals
against sink.toString()) with a call to assertQueryNoLeakCheck so factory
properties are also validated; locate the assertResult method and its callers
and change the assertion to use assertQueryNoLeakCheck with the same expected
string and query (adjust parameters to the available overload on
AbstractCairoTest), removing the manual printSql/Assert.assertEquals usage that
compares sink.toString().
- Around line 127-129: Replace the concatenated SQL string passed to
execute(...) in CoveringIndexMultiKeyOrderingTest (the INSERT INTO t_disj
statement) with a Java 17 text block to improve readability; locate the
execute(...) call in the test class (CoveringIndexMultiKeyOrderingTest) and
convert the multi-part string to a single triple-quoted text block containing
the full SQL, then apply the same refactor to the other INSERT statements
mentioned (the execute(...) calls around lines referenced: 157-159, 210-211,
225-226, 291-293, 345-347) so all SQL literals use multiline string literals
instead of concatenation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 972b7007-d4aa-4a50-83ee-4b68f9ffeb82

📥 Commits

Reviewing files that changed from the base of the PR and between 45683e5 and 3d82522.

📒 Files selected for processing (4)
  • core/src/main/java/io/questdb/griffin/SqlCodeGenerator.java
  • core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java
  • core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java
  • core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexTest.java

Comment on lines +68 to +93
private static final String[] TS_ORDER_TAILS = {
"",
"LIMIT 3",
"LIMIT 7",
"LIMIT 100",
"LIMIT -3",
"LIMIT -7",
"LIMIT -100",
"LIMIT 2,5",
"LIMIT -5,-2",
"ORDER BY ts",
"ORDER BY ts LIMIT 3",
"ORDER BY ts LIMIT -3",
"ORDER BY ts DESC",
"ORDER BY ts DESC LIMIT 3",
"ORDER BY ts DESC LIMIT -3",
};

// Shapes with an explicit non-timestamp order; both paths route through the
// same sort / key-grouping, so they must agree regardless of the bug.
private static final String[] EXPLICIT_ORDER_TAILS = {
"ORDER BY sym",
"ORDER BY sym, ts",
"ORDER BY price",
"ORDER BY price DESC",
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Reorder static fields alphabetically.

The static fields violate the alphabetical ordering requirement. EXPLICIT_ORDER_TAILS should appear before TS_ORDER_TAILS.

📋 Proposed reordering
-    // Shapes that expect global timestamp order; the covering path must match the
-    // plain path on all of them.
-    private static final String[] TS_ORDER_TAILS = {
-            "",
-            "LIMIT 3",
-            "LIMIT 7",
-            "LIMIT 100",
-            "LIMIT -3",
-            "LIMIT -7",
-            "LIMIT -100",
-            "LIMIT 2,5",
-            "LIMIT -5,-2",
-            "ORDER BY ts",
-            "ORDER BY ts LIMIT 3",
-            "ORDER BY ts LIMIT -3",
-            "ORDER BY ts DESC",
-            "ORDER BY ts DESC LIMIT 3",
-            "ORDER BY ts DESC LIMIT -3",
-    };
-
     // Shapes with an explicit non-timestamp order; both paths route through the
     // same sort / key-grouping, so they must agree regardless of the bug.
     private static final String[] EXPLICIT_ORDER_TAILS = {
             "ORDER BY sym",
             "ORDER BY sym, ts",
             "ORDER BY price",
             "ORDER BY price DESC",
     };
+
+    // Shapes that expect global timestamp order; the covering path must match the
+    // plain path on all of them.
+    private static final String[] TS_ORDER_TAILS = {
+            "",
+            "LIMIT 3",
+            "LIMIT 7",
+            "LIMIT 100",
+            "LIMIT -3",
+            "LIMIT -7",
+            "LIMIT -100",
+            "LIMIT 2,5",
+            "LIMIT -5,-2",
+            "ORDER BY ts",
+            "ORDER BY ts LIMIT 3",
+            "ORDER BY ts LIMIT -3",
+            "ORDER BY ts DESC",
+            "ORDER BY ts DESC LIMIT 3",
+            "ORDER BY ts DESC LIMIT -3",
+    };

As per coding guidelines: Java class members must be grouped by kind (static vs. instance) and visibility, and sorted alphabetically.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private static final String[] TS_ORDER_TAILS = {
"",
"LIMIT 3",
"LIMIT 7",
"LIMIT 100",
"LIMIT -3",
"LIMIT -7",
"LIMIT -100",
"LIMIT 2,5",
"LIMIT -5,-2",
"ORDER BY ts",
"ORDER BY ts LIMIT 3",
"ORDER BY ts LIMIT -3",
"ORDER BY ts DESC",
"ORDER BY ts DESC LIMIT 3",
"ORDER BY ts DESC LIMIT -3",
};
// Shapes with an explicit non-timestamp order; both paths route through the
// same sort / key-grouping, so they must agree regardless of the bug.
private static final String[] EXPLICIT_ORDER_TAILS = {
"ORDER BY sym",
"ORDER BY sym, ts",
"ORDER BY price",
"ORDER BY price DESC",
};
// Shapes with an explicit non-timestamp order; both paths route through the
// same sort / key-grouping, so they must agree regardless of the bug.
private static final String[] EXPLICIT_ORDER_TAILS = {
"ORDER BY sym",
"ORDER BY sym, ts",
"ORDER BY price",
"ORDER BY price DESC",
};
// Shapes that expect global timestamp order; the covering path must match the
// plain path on all of them.
private static final String[] TS_ORDER_TAILS = {
"",
"LIMIT 3",
"LIMIT 7",
"LIMIT 100",
"LIMIT -3",
"LIMIT -7",
"LIMIT -100",
"LIMIT 2,5",
"LIMIT -5,-2",
"ORDER BY ts",
"ORDER BY ts LIMIT 3",
"ORDER BY ts LIMIT -3",
"ORDER BY ts DESC",
"ORDER BY ts DESC LIMIT 3",
"ORDER BY ts DESC LIMIT -3",
};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`
around lines 68 - 93, The two static String[] fields are out of alphabetical
order; move EXPLICIT_ORDER_TAILS so it appears before TS_ORDER_TAILS to satisfy
member ordering rules. Edit the class to swap the declarations of
EXPLICIT_ORDER_TAILS and TS_ORDER_TAILS (referencing the exact symbols
EXPLICIT_ORDER_TAILS and TS_ORDER_TAILS) so static fields remain grouped and
sorted alphabetically without changing their contents.

bluestreak01 and others added 2 commits June 1, 2026 12:29
MultiKeyCoveringPageFrameCursor.fillMergedFrame allocated a full
INITIAL_CAPACITY buffer set before the merge loop discovered the
partition was drained and returned null. Because nextImpl re-enters
fillMergedFrame once more per partition to learn it is exhausted,
every merged partition paid one buffer allocation it never used --
for VARCHAR/STRING/ARRAY columns that is an aux buffer plus a
capacity*32 var-data buffer per column, plus the symbol buffer.

fillMergedFrame now scans the merge heads first and returns null
before allocFrameBuffers() when every head is drained. The scan
short-circuits on the first present head, so data-bearing frames
pay a single comparison; the drained call skips the allocation.
The early return is equivalent to the prior best < 0 / count == 0
path, and the skipped buffers belonged to a frame that was never
produced, so no dispatched frame is affected.

The single-key fillFrameForKey already guards its allocation behind
the empty-cursor check; this brings the multi-key merge in line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bluestreak01

Copy link
Copy Markdown
Member Author

Code review

Reviewed the full change surface, not just the diff. The core fix is correct and well-engineered: the descending scan replicates BwdTableReaderPageFrameCursor.computeNativeFrame exactly (the adjustedLo = Math.max(lo, hi - maxRows) split), the negative-limit contract holds (AsyncFilteredNegativeLimitRecordCursor with hasDescendingOrder=false for the FORWARD-direction covering factory), all 19 new tests + 378 existing CoveringIndexTest pass, and the risky shapes cross-check against the no_covering oracle. No correctness bug, no leak in new code, no concurrency hazard. The items below are real but none block.

Verdict: Approve with minor changes.

Critical

None.

Moderate

M1 — PR description overstates the performance benefit. The Tradeoffs bullet says single-key negative-limit queries "now read fewer rows for small N (early termination from the high end)." This is not delivered. The parallel negative-limit path (AsyncFilteredNegativeLimitRecordCursor.fetchAllFramesprepareForDispatch()PageFrameSequence.buildAddressCache()) drains and buffers every frame of every partition before the rowCount >= rowLimit cancel can fire; cancel() only short-circuits the residual-filter reduce stage, not frame production or buffer allocation. This is a pre-existing, universal property of the parallel negative-limit machinery (the cursor serves all filtered bases, not just covering), so it's not a regression — but the claimed benefit doesn't exist. The descending scan delivers correctness (and saves some residual-filter CPU because high-timestamp frames reduce first), not fewer rows read. Suggest rewording the bullet.

M2 — Multi-key k-way merge uses an O(N)/row linear min-scan; the "mirrors HeapRowCursorFactory" comment is inaccurate. MultiKeyCoveringCursor.hasNext and MultiKeyCoveringPageFrameCursor.fillMergedFrame pick the smallest head with an unconditional linear scan over all N keys per emitted row. The comment claims this mirrors the non-covering HeapRowCursor, but that path uses IntLongSortedList — O(N) worst case but adaptive (pollAndReplace shifts only p elements, ≈O(1) when one key dominates), whereas the linear scan is always O(N). A true binary min-heap would be O(log N). For realistic IN-lists (2–20 values) the linear scan is fine and likely wins on cache locality; for large generated lists (hundreds) over dense data it's a real per-row cost. Not a regression (the merge is required for correctness — the old key-grouped order was wrong) and not blocking. Either switch to IntLongSortedList to match the established pattern, or correct the comment to state the actual tradeoff.

M3 — Test coverage gaps on the new paths. Two worth closing:

  • Variable-width covered columns + negative limit / multi-key merge. Every negative-limit test uses price DOUBLE only; the VARCHAR/STRING/BINARY/ARRAY IN-list tests have no limit and no residual filter. finalizeFrame/writeCoveredRow serve all types and the write code is shared with the tested forward path (so risk is moderate), but the descending sub-frame split + growFrameBuffers relocating var-data buffers is an untested interaction. Add a covered-VARCHAR (or ARRAY) LIMIT -N test and a multi-key var-width cross-check with a filter.
  • Sub-frame cap × filter-drop × negative limit. No test combines setMaxRowsPerFrameForTesting(2) + a residual filter that drops rows + a negative limit. That's exactly where an off-by-one in the descending accumulation would hide (e.g. cap=2, single YEAR partition, price > 5 LIMIT -3 so the top sub-frame survives and a lower one is partly dropped).

M4 — mayBeNegativeLimit is misplaced in alphabetical member order. SqlCodeGenerator.java:11337 wedges it between validateSubQueryColumnAndGetGetter and validateTimestampNotInJoinKeys. The surrounding private instance methods are strictly alphabetical, so it should move between lookupColumnIndexesUsingVanillaNames (11088) and prepareLatestByColumnIndexes (11099).

Minor

m1 — maxRowsPerFrame == 0 hangs the descending scan. In nextImplDescending, subLo = Math.max(descPartitionLo, descSubHi - maxRowsPerFrame); with maxRowsPerFrame==0, subLo == descSubHi, so descSubHi never decreases and the loop spins forever. Production-unreachable (PropServerConfiguration.validatePageFrameRows rejects <1), but the test-only setMaxRowsPerFrameForTesting(0) hook would hang CI rather than fail. Cheap guard: long step = Math.max(1, maxRowsPerFrame);.

m2 — Pre-existing error-path leak in wrapCoveringWithFilter. If compileWorkerFiltersConditionally(...) (compiles JIT filters, can throw) or deepClone(...) throws before the AsyncFilteredRecordCursorFactory constructor takes ownership, coveringFactory + filter (+ limit function) leak — the only catch (~10548) frees only dfcFactory. Pre-existing, not a regression. The new mayBeNegativeLimit → init() adds one more throw point but it's essentially unreachable (init runs only for constant-folded limits, a no-op). Since this PR already carries an "error path leak fix" commit, consider wrapping the two callsites (10297, 10397) in a try/catch that frees coveringFactory + filter on throw.

m3 — Stale comment. CoveringIndexRecordCursorFactory.java:~213 says an empty multiKeys list "makes advanceKey() report no rows" — but advanceKey() now throws UnsupportedOperationException; the empty-list short-circuit lives in hasNext()openNextPartitionCursors(). Reword.

m4 — Boolean field naming. protected boolean descending should be isDescending per the is…/has… rule — it sits directly above isExhausted, which uses the prefix.

m5 — Duplicated merge logic. The min-head selection loop and the park-before-probe partition-open loop are near-identical in MultiKeyCoveringCursor and MultiKeyCoveringPageFrameCursor (only openForwardCoveringCursor is shared). A future fix to the merge invariant must touch both. Consider a shared helper.

m6 — fillMergedFrame double-pass (micro). The anyHead pre-scan over N keys duplicates the first iteration of the per-row min-scan. Defer allocFrameBuffers() until after the first best >= 0 selection to fuse the two passes and drop the "allocate-then-discard" rationale.

m7 — Test convention. The explicit-value tests (testSinglePartitionInterleavedExplicit, testSingleKeyOrderByTsDescLimitExplicit, testReferenceOracleIsTimestampOrdered) use a custom assertResult/printSql instead of assertQueryNoLeakCheck, skipping the factory-property assertions (expectSize/expectedTimestamp) that matter for a LIMIT fix. The oracle cross-check method itself is correctly custom (it compares two queries and asserts the plan is genuinely covering).

m8 — Labels. Consider adding Performance (k-way merge + buffer-alloc commit) and regression (9.4.0 shipped getPageFrameCursor hardcoding ORDER_ASC — a user-facing correctness regression vs the plain-index path).

Verified safe (checked and cleared)

  • Concurrency: the reused frame object is safe — buildAddressCache/PageFrameAddressCache.add snapshots all frame addresses into off-heap lists on the single producer thread before any worker runs; covering frames are always NATIVE so no lazy re-read. Producer-only merge state.
  • Multi-key DESC CairoException.critical guard: unreachable defensive guard — getScanDirection()==FORWARD blocks ORDER BY ts DESC elision and recordCursorSupportsRandomAccess()==false blocks top-K; multi-key negative limits route to serial.
  • k-way merge duplicate/tie on shared row ids: cannot happen — WhereClauseParser dedups IN-list strings and the symbol map is a bijection, so distinct keys never share a row id (verified with sym IN ('A','A')).
  • Multi-key serial fallback last-N correctness, deferred-symbol resolution, off-by-one bounds, latestBy interaction, generateLimit double/dropped-limit: all traced and correct.

Summary

Merge after addressing the description accuracy (M1), member ordering (M4), and ideally the var-width / filter-drop negative-limit tests (M3). The performance items (M2) and minors are non-blocking. The multi-key output order now matches the plain-index path (timestamp-ordered), which is why 7 existing assertions were updated — a correct, user-visible consistency improvement.

🤖 Generated with Claude Code

@puzpuzpuz puzpuzpuz self-requested a review June 1, 2026 13:28
@puzpuzpuz

Copy link
Copy Markdown
Contributor

Multi-key covering is only per-key-per-partition ordered, never globally timestamp-ordered, so a correct last-N-by-timestamp cannot be produced from it without a k-way merge.

Heads-up: k-way merge is introduced in #7150 (there is a separate bug which requires k-way merge similar to FilterOnValues cursor's one).

puzpuzpuz and others added 4 commits June 1, 2026 18:03
wrapCoveringWithFilter created the limit function and received a
CoveringIndexRecordCursorFactory (which already owns the partition
frame cursor factory) but had no cleanup on its own throw paths. A
throw from mayBeNegativeLimit's init(), collectColumnIndexes, the
worker-filter JIT compile, or the AsyncFilteredRecordCursorFactory
constructor leaked the covering factory's cursors and key functions,
the filter, and the freshly created limit function. The only
enclosing handler freed the partition frame cursor factory alone.

The method body now runs under a try/catch(Throwable) that frees the
filter, the limit function, and the covering factory before
rethrowing. Closing the covering factory also closes the partition
frame cursor factory, so the two callers now null their own reference
once they hand it to the covering factory, and the outer catch no
longer double-frees it. The method frees the limit function through
x = Misc.free(x) on the non-pushed and serial-fallback paths, keeping
the catch idempotent when it was already released or transferred.

Both factory constructors take no effective ownership on failure --
the half-built object never reaches a caller, so close() never runs
-- so freeing all three inputs in the catch is correct.

Add four negative-limit tests to CoveringIndexMultiKeyOrderingTest:
- var-width covered columns (VARCHAR with a NULL, DOUBLE[]) on the
  single-key backward path, at default cap and 2-row sub-frames
- the same columns on the multi-key merge and serial-fallback paths
- a 10k-row single partition with LIMIT -5000 that forces a
  growFrameBuffers mid-fill relocation
- a 2-row cap with a residual filter that drops rows in the
  surviving top descending sub-frame under a negative limit

CoveringIndexMultiKeyOrderingTest 19/19 and CoveringIndexTest 378/378
pass under assertMemoryLeak.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- nextImplDescending now derives the sub-frame low bound from
  Math.max(1, maxRowsPerFrame), so a test-only cap of 0 cannot stall
  the descending loop (subLo == descSubHi).
- Move mayBeNegativeLimit, with its javadoc, into alphabetical order
  between lookupColumnIndexesUsingVanillaNames and
  prepareLatestByColumnIndexes.
- Restore the fillFrameForKey rowCap/pendingRowCursor contract javadoc
  onto fillFrameForKey and give allocFrameBuffers its own short doc;
  the cap split had left the contract sitting on allocFrameBuffers.
- Correct the MultiKeyCoveringCursor merge comment: it matches the
  result order HeapRowCursorFactory produces but uses a linear O(R*N)
  min-scan, not that path's O(log N) heap poll.
- The unreachable backward-multi-key guard now throws
  CairoException.nonCritical() instead of critical(0), which falsely
  implied storage corruption.
- Reword a stale comment: an empty multiKeys list makes hasNext()'s
  merge find no heads and openNextPartitionCursors() open nothing,
  rather than the never-called advanceKey() reporting no rows.

CoveringIndexMultiKeyOrderingTest 19/19 and CoveringIndexTest 378/378
pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
puzpuzpuz
puzpuzpuz previously approved these changes Jun 1, 2026
@puzpuzpuz

Copy link
Copy Markdown
Contributor

Two moderate items I'd like to highlight. They're not blockers and are more or less essential for the chosen fix, but that's something we could improve in the future patches.

M1 — Negative-limit descending path over-reads/over-materializes [in-diff, perf]

SingleKeyCoveringPageFrameCursor.nextImplDescending (CoveringIndexRecordCursorFactory.java:2366) sizes each backward sub-frame to a fixed maxRowsPerFrame (default 1,000,000) window of row-ids and calls fillFrameForKey(..., Integer.MAX_VALUE), which materializes every matching row in that window into native buffers (writeCoveredRow) before the consumer reads anything. AsyncFilteredNegativeLimitRecordCursor.fetchAllFrames (:215) cancels only after the first frame is fully built. So ... LIMIT -5 over a hot key in a large latest partition copies up to ~1M rows to return 5.

Unlike the standard backward cursor, this path ignores pageFrameMinRows/sharedQueryWorkerCount — BwdTableReaderPageFrameCursor:480 sizes frames via calculatePageFrameRowLimit(...). The per-row copy is inherent to covering (no mmap'd column to point at, unlike the zero-copy BwdTableReaderPageFrameCursor), but the fixed 1M window is an avoidable parameter choice. Note: enabling covering can therefore be slower than the plain index for large LIMIT -N. Confirmed independently by Agents 3, 10, 11. Mitigation: thread the limit N down to bound the first sub-frame, and/or adopt calculatePageFrameRowLimit for consistency.

M2 — Multi-key negative-limit serial fallback scans ~2–3× [in-diff, perf tradeoff]

The PR routes multi-key negative/bind-var limits to FilteredRecordCursorFactory + LimitRecordCursorFactory. Because MultiKeyCoveringCursor.size() returns -1 (:1924), LimitRecordCursorFactory.ensureBoundsResolved (LimitRecordCursorFactory.java:279) runs a full calculateSize counting pass, then toTop + skip + take — reconstructing the k-way merge ~2–3 times. The PR explicitly accepts this throughput tradeoff, and it's correct. But SingleKeyCoveringCursor.size() (:2202) already shows the pattern: a cheap MultiKeyCoveringCursor.size()
summing each key's posting size() (which has an O(genCount) fast count path) would eliminate the redundant counting pass. Worth doing since this path is the deliberate fallback for a common shape. (Agents 3, 11)

The CI "Applying formatting" step runs IntelliJ's CLI formatter and then
fails the build via "git diff --exit-code" if it changes anything. Three
.returns("""...""") text blocks had their content and closing delimiter
committed at the same indent as the opening .returns call, while the
formatter expects the content indented eight spaces deeper (matching every
other text block in the file).

Reindent the three blocks to the formatter's expectation. The change is
whitespace-only: Java strips the common leading indentation of a text
block based on its least-indented line, so the asserted string values are
unchanged and the tests still verify the same output.

This unblocks the "Rust Test and Lint on linux-jdk25" job, where the
formatting gate runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bluestreak01

bluestreak01 commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

/azp run macwin

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@bluestreak01

Copy link
Copy Markdown
Member Author

/azp run macwin

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@mtopolnik

Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 243 / 256 (94.92%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/SqlCodeGenerator.java 40 45 88.89%
🔵 io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java 203 211 96.21%

@bluestreak01 bluestreak01 merged commit 8a53c15 into master Jun 2, 2026
53 checks passed
@bluestreak01 bluestreak01 deleted the fix-covering-index-negative-limit branch June 2, 2026 14:17
puzpuzpuz added a commit that referenced this pull request Jun 2, 2026
Resolve conflicts from upstream PRs #7182 (LIMIT -N over a covering
index), #7187 (fluent QueryAssertion builder) and #7157 (ORDER BY
native-memory byte caps).

CoveringIndexRecordCursorFactory: both this branch and #7182
independently reworked the multi-key merge to emit rows in row-id
(ts-ascending) order. Take upstream's version as the base -- its
min-scan merge plus the new single-key LIMIT -N backward scan -- then
re-apply this branch's additive fixes on top: getScanDirection(),
duplicate IN-key dedup, and the TablePageFrameCursor promotion (fixes
the parallel-keyed-group-by-under-SelectedRecord ClassCastException).
getScanDirection() reports OTHER only for multi-key latestBy; single-key
latestBy stays FORWARD (it returns one row, trivially ts-ordered) so the
designated-timestamp metadata upstream's tests expect is preserved.

EncodedSortRecordCursor: keep upstream's extracted throwLimitOverflow().

Tests: take upstream's fluent-builder assertions for shared cases and
keep this branch's new regression tests. #7187 removed the legacy
positional assertQuery()/assertQueryNoLeakCheck() helpers, so migrate
~35 remaining call sites across 10 test files to the fluent builder.

Verified: core test-compile clean; CoveringIndexTest (394),
ExplainPlanTest (534), AsOfJoinTest (115), CompiledFilterTest (50),
LimitTest (47) and the migrated group-by / distinct / IPv4 / array tests
all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bluestreak01 added a commit that referenced this pull request Jun 3, 2026
Bring in master's #7182 ("return the last N rows for LIMIT -N over a
covering index"). Its production changes to SqlCodeGenerator and
CoveringIndexRecordCursorFactory apply cleanly; this branch touched
neither file.

The only conflict was in CoveringIndexTest.java. Master added four
LIMIT -N regression tests (testNegativeLimit*) plus two private
helpers, getPlan and assertPlanDoesNotContain. This branch had already
deleted both helpers when it migrated covering-index plan checks to the
QueryAssertion builder (.assertsPlanContaining / .assertsPlanNotContaining).

Resolution: keep master's four new tests and restore getPlan, which
those tests call. Drop assertPlanDoesNotContain, which now has no
callers - this matches the branch's deletion and master's intent of
adding the tests. The new tests still use assertSql/getPlan rather
than the builder; migrating them is left as a follow-up since they
exercise the parallel Async Filter path whose factory does not fit the
builder's full random-access battery without per-test tuning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect or unexpected behavior Core Related to storage, data type, etc. SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LIMIT -N over a posting/covering index with a residual filter returns the first N rows instead of the last N

3 participants