fix(sql): return the last N rows for LIMIT -N over a covering index by bluestreak01 · Pull Request #7182 · questdb/questdb

bluestreak01 · 2026-05-29T15:32:29Z

Summary

SELECT ... LIMIT -N over a covering (posting) index with a residual filter returned the first N rows instead of the last N. The result was silently wrong — no error, no warning. This affected single-key queries (WHERE sym = 'A' AND price > 0 LIMIT -3) and the same shape with a multi-key (sym IN (...)) predicate or a bind-variable limit.

The covering factory ignored the requested scan order when serving a page-frame cursor: getPageFrameCursor(executionContext, order) always opened ascending partition frames, while getScanDirection() advertised FORWARD. The parallel negative-limit machinery then collected the lowest-timestamp rows believing they were the highest.

Fix

The two query shapes need different handling because they have different ordering guarantees.

Single-key covering queries are globally timestamp-ordered (partitions ascending, rows ascending within a partition), so a backward scan yields the true last N rows. getPageFrameCursor now honors ORDER_DESC:

iterate partitions latest-to-earliest, and
within each partition, split the row range into sub-frames at most maxRowsPerFrame wide and emit the highest row-range sub-frame first, each read forward so rows stay ascending within a frame.

This mirrors BwdTableReaderPageFrameCursor and stays correct when a partition holds more than maxRowsPerFrame matching rows — a case where a partition-reverse-only fix would still return wrong results. Parallel filtering is preserved for this shape.

Multi-key covering is only per-key-per-partition ordered, never globally timestamp-ordered, so a correct last-N-by-timestamp cannot be produced from it without a k-way merge. When the limit may be negative (a constant below zero, or a bind variable whose sign is unknown at compile time), codegen routes multi-key queries to the serial FilteredRecordCursorFactory, where LimitRecordCursorFactory computes last-N via size plus skip. A defensive guard in getPageFrameCursor rejects a backward scan on the multi-key path.

getScanDirection() stays FORWARD so ORDER BY ts DESC is not wrongly elided onto the ascending cursor.

Tradeoffs

Single-key negative-limit covering queries keep parallel filtering and now read fewer rows for small N (early termination from the high end), instead of scanning ascending and discarding.
Multi-key (and bind-variable) negative-limit covering queries forfeit parallel filtering for this narrow shape: they fall back to the serial filter plus a Limit node. The bind-variable case takes the serial path even when the value turns out positive at runtime, because the sign is unknown at compile time. This is a correctness-for-throughput trade on an otherwise silently-wrong path; positive-limit and no-limit multi-key queries are unaffected and still run in parallel.

Test plan

New regression tests in CoveringIndexTest:
- the issue reproducer (10 single-row partitions, LIMIT -3 returns 8,9,10), including a filter that eliminates the low rows and a larger limit spanning all partitions
- a multi-frame-per-partition variant with setMaxRowsPerFrameForTesting(2), which a partition-reverse-only fix would still fail, across -3/-4/-5 to cross sub-frame boundaries
- a bind-variable limit on a single-key query, asserting last-N for a negative value and head-N for a positive value on the same compiled statement
- a multi-key query asserting the plan does not use the parallel Async Filter path and returns the correct tail, while a positive multi-key limit still does
Confirmed the single-key tests fail on the unpatched code and pass with the fix.
Full CoveringIndexTest (378), PostingIndexCriticalIssuesTest (40), and CoveringCompressorTest (17) pass.

Summary by CodeRabbit

New Features

Added support for negative LIMIT values on covering index queries, enabling efficient retrieval of the last N rows

Bug Fixes

Fixed row ordering correctness for multi-key covering index queries to maintain timestamp order across partitions
Improved LIMIT pushdown behavior on covering indexes to prevent incorrect results when ORDER BY clauses are present

Tests

Added comprehensive test suite verifying covering index query correctness across multiple scenarios including filters, projections, and LIMIT variations

A negative LIMIT over a posting/covering index with a residual filter silently returned the first N rows instead of the last N. The covering factory ignored the requested scan order when serving a page-frame cursor: getPageFrameCursor always opened ascending partition frames while getScanDirection() advertised FORWARD, so the parallel negative-limit machinery collected the lowest-timestamp rows believing they were the highest. Single-key covering queries now honor ORDER_DESC with a genuine backward scan: DESC partition iteration plus per-partition row-range sub-frames emitted high-range-first, each read forward so rows stay ascending within a frame. This mirrors BwdTableReaderPageFrameCursor and stays correct when a partition holds more than maxRowsPerFrame matching rows, where a partition-reverse-only fix would not. Parallel filtering is preserved for this shape. Multi-key covering is only per-key-per-partition ordered, never globally timestamp-ordered, so a correct last-N-by-ts cannot come from it without a k-way merge. When the limit may be negative (a constant below zero, or a bind variable whose sign is unknown at compile time), codegen routes multi-key queries to the serial FilteredRecordCursorFactory, where LimitRecordCursorFactory computes last-N via size plus skip. A defensive guard in getPageFrameCursor rejects a backward scan on the multi-key path. getScanDirection() stays FORWARD so ORDER BY ts DESC is not wrongly elided onto the ascending cursor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-29T15:32:39Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ff50fbbd-0b20-47ba-9bad-4f077016a869

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

This PR fixes a correctness bug where LIMIT -N over a covering index with a residual filter incorrectly returns the first N rows instead of the last N. It implements two levels of defense: codegen-time guards to disable unsafe parallel pushdown, and a new backward-scan path for single-key page-frame cursors. Multi-key iteration is rewritten to use k-way merging for correct timestamp ordering across keys.

Changes

Covering Index Negative LIMIT and Ordering Fixes

Layer / File(s)	Summary
LIMIT pushdown decision logic `core/src/main/java/io/questdb/griffin/SqlCodeGenerator.java`	Adds `coveringScanOrderMatchesQuery()` to check ORDER BY alignment and `mayBeNegativeLimit()` to detect negative LIMIT values. Reworks `wrapCoveringWithFilter()` to compute `pushLimit` only when covering scan order matches the query; disables pushdown when negative-limit behavior is possible but the covering factory cannot serve it, falling back to serial path and freeing the async limit function to avoid leaks.
Page-frame cursor ordering support `core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java` (lines 27–450)	Adds `CairoException` import and updates `getPageFrameCursor()` to compute a `descending` flag from the requested order, throws exception for unsupported multi-key backward scans, and routes to the appropriate page-frame cursor. Introduces `supportsNegativeLimitPageFrame()` to report single-key support and adds `openForwardCoveringCursor()` helper.
Page-frame buffer and fill refactoring `core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java` (lines 746–1411)	Refactors `CoveringPageFrameCursor` to introduce a `descending` flag and variable `maxRowsPerFrame`, makes buffer methods protected, and splits frame construction into `allocFrameBuffers()` and `finalizeFrame()` stages. Updates `fillFrameForKey()` signature to accept explicit `rowCap` for sub-frame capping. Changes `writeCoveredRow()` visibility to protected for subclass reuse.
Multi-key k-way merge for timestamp ordering `core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java` (lines 1850–2161)	Overwrites `MultiKeyCoveringCursor` to replace per-partition drain with explicit per-key cursor arrays and k-way row-id merging in `hasNext()`, ensuring rows emit in global timestamp order. Rewrites `MultiKeyCoveringPageFrameCursor` to resume merges across frames via `mergePartitionIndex` and select rows by smallest head row id.
Single-key backward-scan implementation `core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java` (lines 2255–2368)	Implements descending (negative-limit) iteration in `SingleKeyCoveringPageFrameCursor` by splitting partition row ranges into capped sub-frames and emitting the highest sub-frame first. Adds descending state (`descPartitionIndex`, `descPartitionLo`, `descSubHi`), routes `nextImpl()` to new `nextImplDescending()`, and resets descending state in `resetIterationState()`.
Comprehensive multi-key ordering test suite `core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`	Introduces new test class with 15 cross-check test methods validating that covering query results match oracle (`/+ no_covering /`) across varied shapes: multiple keys, partitions, residual filters, projections, negative limits, bind variables, and explicit ORDER BY. Includes helpers to verify `CoveringIndex` plan usage and build deterministic interleaved test data.
Test expectation updates and negative LIMIT regressions `core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexTest.java`	Updates 7 existing IN-list test assertions to reflect timestamp-ordered output across keys/partitions instead of key-grouped. Adds 4 new regression tests: `testNegativeLimitSingleKeyReturnsLastRows`, `testNegativeLimitSingleKeyMultiFramePerPartition`, `testNegativeLimitSingleKeyBindVariable`, and `testNegativeLimitMultiKeyFallsBackToSerial`, covering both positive and negative limits, bind variables, and multi-frame partitions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

puzpuzpuz
nwoolmer

Poem

🐰 A rabbit hops through frames both ways,
Backward scans through interleaved days,
Multi-key merging in perfect time,
No more wrong limits—the order's sublime!
Last N rows now correctly flow,
Negative limits steal the show. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main fix: enabling LIMIT -N over covering indexes to correctly return the last N rows instead of the first N rows.
Linked Issues check	✅ Passed	All coding objectives from issue `#7181` are met: single-key backward scan support, multi-key fallback routing, proper ORDER BY handling, and comprehensive regression tests.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing LIMIT -N behavior over covering indexes. No unrelated refactoring or scope creep detected.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-covering-index-negative-limit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Multi-key queries over a covering (posting) index returned rows grouped by key -- each key's posting list drained in full before the next -- instead of in designated-timestamp order. That disagreed with the plain index path on the same data for every shape that expects timestamp order: bare scans, positive and negative LIMIT, and even an explicit ORDER BY ts, whose sort was elided onto the unordered cursor. Both covering cursors now k-way merge the per-key cursors by row id, mirroring HeapRowCursorFactory on the non-covering path: - the serial record cursor holds one open cursor per key and emits the smallest row id each step; - the parallel page-frame cursor merges keys into timestamp-ordered frames (symbol key written per row), chunked at maxRowsPerFrame with the merge state resumed across frames. Separately, ORDER BY ts DESC with a LIMIT returned the wrong end even for a single key: codegen pushed the limit into the ascending parallel scan ahead of the sort, truncating the oldest rows so the sort then reversed those. The limit is now pushed into the scan only when the covering forward scan order is already the query's final order; otherwise the outer LimitRecordCursorFactory applies it after the sort. Multi-key negative limits still use the serial path, as the page-frame merge is forward-only. Updated the existing covering tests that had encoded the old key-grouped order, and added a cross-check suite that compares the covering path against the no_covering path across data layouts, filters, limits and orderings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java (2)
336-339: ⚡ Quick win

Consider using assertQueryNoLeakCheck() for factory property validation.

The current implementation uses printSql + Assert.assertEquals, which only validates data correctness. assertQueryNoLeakCheck() additionally validates factory properties (supportsRandomAccess, expectSize, expectedTimestamp), providing more comprehensive test coverage.
🔍 Example refactor

If there's a suitable assertQueryNoLeakCheck overload that accepts expected results as a string:
 private void assertResult(String expected, String query) throws SqlException {
-    printSql(query);
-    Assert.assertEquals(query, expected, sink.toString());
+    assertQueryNoLeakCheck(expected, query, null, null, true, true);
 }
Note: Adjust the overload and parameters based on the actual AbstractCairoTest API.
Based on learnings: Use assertQueryNoLeakCheck() to assert the results of queries. This method asserts factory properties (supportsRandomAccess, expectSize, expectedTimestamp) in addition to data correctness.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`
around lines 336 - 339, Replace the current assertion that only checks data
correctness in assertResult (which calls printSql and Assert.assertEquals
against sink.toString()) with a call to assertQueryNoLeakCheck so factory
properties are also validated; locate the assertResult method and its callers
and change the assertion to use assertQueryNoLeakCheck with the same expected
string and query (adjust parameters to the available overload on
AbstractCairoTest), removing the manual printSql/Assert.assertEquals usage that
compares sink.toString().
127-129: 💤 Low value

Consider using Java 17 text blocks for SQL statements.

The INSERT statements use string concatenation. Text blocks (multiline string literals) would improve readability and align with modern Java 17 practices recommended in the coding guidelines.
📝 Example refactor for line 127-129
-            execute("INSERT INTO t_disj " +
-                    "SELECT dateadd('d', x::INT, '2024-01-01T00:00:00Z'::TIMESTAMP), " +
-                    "CASE WHEN x <= 5 THEN 'A' ELSE 'B' END, x::DOUBLE FROM long_sequence(10)");
+            execute("""
+                    INSERT INTO t_disj
+                    SELECT dateadd('d', x::INT, '2024-01-01T00:00:00Z'::TIMESTAMP),
+                           CASE WHEN x <= 5 THEN 'A' ELSE 'B' END,
+                           x::DOUBLE
+                    FROM long_sequence(10)
+                    """);
Apply similar transformation to other INSERT statements throughout the file.
As per coding guidelines: Use modern Java 17 features: enhanced switch, multiline string literals, and pattern variables in instanceof checks.

Also applies to: 157-159, 210-211, 225-226, 291-293, 345-347
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`
around lines 127 - 129, Replace the concatenated SQL string passed to
execute(...) in CoveringIndexMultiKeyOrderingTest (the INSERT INTO t_disj
statement) with a Java 17 text block to improve readability; locate the
execute(...) call in the test class (CoveringIndexMultiKeyOrderingTest) and
convert the multi-part string to a single triple-quoted text block containing
the full SQL, then apply the same refactor to the other INSERT statements
mentioned (the execute(...) calls around lines referenced: 157-159, 210-211,
225-226, 291-293, 345-347) so all SQL literals use multiline string literals
instead of concatenation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`:
- Around line 68-93: The two static String[] fields are out of alphabetical
order; move EXPLICIT_ORDER_TAILS so it appears before TS_ORDER_TAILS to satisfy
member ordering rules. Edit the class to swap the declarations of
EXPLICIT_ORDER_TAILS and TS_ORDER_TAILS (referencing the exact symbols
EXPLICIT_ORDER_TAILS and TS_ORDER_TAILS) so static fields remain grouped and
sorted alphabetically without changing their contents.

---

Nitpick comments:
In
`@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java`:
- Around line 336-339: Replace the current assertion that only checks data
correctness in assertResult (which calls printSql and Assert.assertEquals
against sink.toString()) with a call to assertQueryNoLeakCheck so factory
properties are also validated; locate the assertResult method and its callers
and change the assertion to use assertQueryNoLeakCheck with the same expected
string and query (adjust parameters to the available overload on
AbstractCairoTest), removing the manual printSql/Assert.assertEquals usage that
compares sink.toString().
- Around line 127-129: Replace the concatenated SQL string passed to
execute(...) in CoveringIndexMultiKeyOrderingTest (the INSERT INTO t_disj
statement) with a Java 17 text block to improve readability; locate the
execute(...) call in the test class (CoveringIndexMultiKeyOrderingTest) and
convert the multi-part string to a single triple-quoted text block containing
the full SQL, then apply the same refactor to the other INSERT statements
mentioned (the execute(...) calls around lines referenced: 157-159, 210-211,
225-226, 291-293, 345-347) so all SQL literals use multiline string literals
instead of concatenation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 972b7007-d4aa-4a50-83ee-4b68f9ffeb82

📥 Commits

Reviewing files that changed from the base of the PR and between 45683e5 and 3d82522.

📒 Files selected for processing (4)

core/src/main/java/io/questdb/griffin/SqlCodeGenerator.java
core/src/main/java/io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java
core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java
core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexTest.java

coderabbitai · 2026-06-01T10:48:21Z

+    private static final String[] TS_ORDER_TAILS = {
+            "",
+            "LIMIT 3",
+            "LIMIT 7",
+            "LIMIT 100",
+            "LIMIT -3",
+            "LIMIT -7",
+            "LIMIT -100",
+            "LIMIT 2,5",
+            "LIMIT -5,-2",
+            "ORDER BY ts",
+            "ORDER BY ts LIMIT 3",
+            "ORDER BY ts LIMIT -3",
+            "ORDER BY ts DESC",
+            "ORDER BY ts DESC LIMIT 3",
+            "ORDER BY ts DESC LIMIT -3",
+    };
+
+    // Shapes with an explicit non-timestamp order; both paths route through the
+    // same sort / key-grouping, so they must agree regardless of the bug.
+    private static final String[] EXPLICIT_ORDER_TAILS = {
+            "ORDER BY sym",
+            "ORDER BY sym, ts",
+            "ORDER BY price",
+            "ORDER BY price DESC",
+    };


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Reorder static fields alphabetically.

The static fields violate the alphabetical ordering requirement. EXPLICIT_ORDER_TAILS should appear before TS_ORDER_TAILS.

📋 Proposed reordering

- // Shapes that expect global timestamp order; the covering path must match the - // plain path on all of them. - private static final String[] TS_ORDER_TAILS = { - "", - "LIMIT 3", - "LIMIT 7", - "LIMIT 100", - "LIMIT -3", - "LIMIT -7", - "LIMIT -100", - "LIMIT 2,5", - "LIMIT -5,-2", - "ORDER BY ts", - "ORDER BY ts LIMIT 3", - "ORDER BY ts LIMIT -3", - "ORDER BY ts DESC", - "ORDER BY ts DESC LIMIT 3", - "ORDER BY ts DESC LIMIT -3", - }; - // Shapes with an explicit non-timestamp order; both paths route through the // same sort / key-grouping, so they must agree regardless of the bug. private static final String[] EXPLICIT_ORDER_TAILS = { "ORDER BY sym", "ORDER BY sym, ts", "ORDER BY price", "ORDER BY price DESC", }; + + // Shapes that expect global timestamp order; the covering path must match the + // plain path on all of them. + private static final String[] TS_ORDER_TAILS = { + "", + "LIMIT 3", + "LIMIT 7", + "LIMIT 100", + "LIMIT -3", + "LIMIT -7", + "LIMIT -100", + "LIMIT 2,5", + "LIMIT -5,-2", + "ORDER BY ts", + "ORDER BY ts LIMIT 3", + "ORDER BY ts LIMIT -3", + "ORDER BY ts DESC", + "ORDER BY ts DESC LIMIT 3", + "ORDER BY ts DESC LIMIT -3", + };

As per coding guidelines: Java class members must be grouped by kind (static vs. instance) and visibility, and sorted alphabetically.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

private static final String[] TS_ORDER_TAILS = {

"",

"LIMIT 3",

"LIMIT 7",

"LIMIT 100",

"LIMIT -3",

"LIMIT -7",

"LIMIT -100",

"LIMIT 2,5",

"LIMIT -5,-2",

"ORDER BY ts",

"ORDER BY ts LIMIT 3",

"ORDER BY ts LIMIT -3",

"ORDER BY ts DESC",

"ORDER BY ts DESC LIMIT 3",

"ORDER BY ts DESC LIMIT -3",

};

// Shapes with an explicit non-timestamp order; both paths route through the

// same sort / key-grouping, so they must agree regardless of the bug.

private static final String[] EXPLICIT_ORDER_TAILS = {

"ORDER BY sym",

"ORDER BY sym, ts",

"ORDER BY price",

"ORDER BY price DESC",

};

// Shapes with an explicit non-timestamp order; both paths route through the

// same sort / key-grouping, so they must agree regardless of the bug.

private static final String[] EXPLICIT_ORDER_TAILS = {

"ORDER BY sym",

"ORDER BY sym, ts",

"ORDER BY price",

"ORDER BY price DESC",

};

// Shapes that expect global timestamp order; the covering path must match the

// plain path on all of them.

private static final String[] TS_ORDER_TAILS = {

"",

"LIMIT 3",

"LIMIT 7",

"LIMIT 100",

"LIMIT -3",

"LIMIT -7",

"LIMIT -100",

"LIMIT 2,5",

"LIMIT -5,-2",

"ORDER BY ts",

"ORDER BY ts LIMIT 3",

"ORDER BY ts LIMIT -3",

"ORDER BY ts DESC",

"ORDER BY ts DESC LIMIT 3",

"ORDER BY ts DESC LIMIT -3",

};

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/src/test/java/io/questdb/test/cairo/covering/CoveringIndexMultiKeyOrderingTest.java` around lines 68 - 93, The two static String[] fields are out of alphabetical order; move EXPLICIT_ORDER_TAILS so it appears before TS_ORDER_TAILS to satisfy member ordering rules. Edit the class to swap the declarations of EXPLICIT_ORDER_TAILS and TS_ORDER_TAILS (referencing the exact symbols EXPLICIT_ORDER_TAILS and TS_ORDER_TAILS) so static fields remain grouped and sorted alphabetically without changing their contents.

MultiKeyCoveringPageFrameCursor.fillMergedFrame allocated a full INITIAL_CAPACITY buffer set before the merge loop discovered the partition was drained and returned null. Because nextImpl re-enters fillMergedFrame once more per partition to learn it is exhausted, every merged partition paid one buffer allocation it never used -- for VARCHAR/STRING/ARRAY columns that is an aux buffer plus a capacity*32 var-data buffer per column, plus the symbol buffer. fillMergedFrame now scans the merge heads first and returns null before allocFrameBuffers() when every head is drained. The scan short-circuits on the first present head, so data-bearing frames pay a single comparison; the drained call skips the allocation. The early return is equivalent to the prior best < 0 / count == 0 path, and the skipped buffers belonged to a frame that was never produced, so no dispatched frame is affected. The single-key fillFrameForKey already guards its allocation behind the empty-cursor check; this brings the multi-key merge in line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bluestreak01 · 2026-06-01T12:39:59Z

Code review

Reviewed the full change surface, not just the diff. The core fix is correct and well-engineered: the descending scan replicates BwdTableReaderPageFrameCursor.computeNativeFrame exactly (the adjustedLo = Math.max(lo, hi - maxRows) split), the negative-limit contract holds (AsyncFilteredNegativeLimitRecordCursor with hasDescendingOrder=false for the FORWARD-direction covering factory), all 19 new tests + 378 existing CoveringIndexTest pass, and the risky shapes cross-check against the no_covering oracle. No correctness bug, no leak in new code, no concurrency hazard. The items below are real but none block.

Verdict: Approve with minor changes.

Critical

None.

Moderate

M1 — PR description overstates the performance benefit. The Tradeoffs bullet says single-key negative-limit queries "now read fewer rows for small N (early termination from the high end)." This is not delivered. The parallel negative-limit path (AsyncFilteredNegativeLimitRecordCursor.fetchAllFrames → prepareForDispatch() → PageFrameSequence.buildAddressCache()) drains and buffers every frame of every partition before the rowCount >= rowLimit cancel can fire; cancel() only short-circuits the residual-filter reduce stage, not frame production or buffer allocation. This is a pre-existing, universal property of the parallel negative-limit machinery (the cursor serves all filtered bases, not just covering), so it's not a regression — but the claimed benefit doesn't exist. The descending scan delivers correctness (and saves some residual-filter CPU because high-timestamp frames reduce first), not fewer rows read. Suggest rewording the bullet.

M2 — Multi-key k-way merge uses an O(N)/row linear min-scan; the "mirrors HeapRowCursorFactory" comment is inaccurate. MultiKeyCoveringCursor.hasNext and MultiKeyCoveringPageFrameCursor.fillMergedFrame pick the smallest head with an unconditional linear scan over all N keys per emitted row. The comment claims this mirrors the non-covering HeapRowCursor, but that path uses IntLongSortedList — O(N) worst case but adaptive (pollAndReplace shifts only p elements, ≈O(1) when one key dominates), whereas the linear scan is always O(N). A true binary min-heap would be O(log N). For realistic IN-lists (2–20 values) the linear scan is fine and likely wins on cache locality; for large generated lists (hundreds) over dense data it's a real per-row cost. Not a regression (the merge is required for correctness — the old key-grouped order was wrong) and not blocking. Either switch to IntLongSortedList to match the established pattern, or correct the comment to state the actual tradeoff.

M3 — Test coverage gaps on the new paths. Two worth closing:

Variable-width covered columns + negative limit / multi-key merge. Every negative-limit test uses price DOUBLE only; the VARCHAR/STRING/BINARY/ARRAY IN-list tests have no limit and no residual filter. finalizeFrame/writeCoveredRow serve all types and the write code is shared with the tested forward path (so risk is moderate), but the descending sub-frame split + growFrameBuffers relocating var-data buffers is an untested interaction. Add a covered-VARCHAR (or ARRAY) LIMIT -N test and a multi-key var-width cross-check with a filter.
Sub-frame cap × filter-drop × negative limit. No test combines setMaxRowsPerFrameForTesting(2) + a residual filter that drops rows + a negative limit. That's exactly where an off-by-one in the descending accumulation would hide (e.g. cap=2, single YEAR partition, price > 5 LIMIT -3 so the top sub-frame survives and a lower one is partly dropped).

M4 — mayBeNegativeLimit is misplaced in alphabetical member order. SqlCodeGenerator.java:11337 wedges it between validateSubQueryColumnAndGetGetter and validateTimestampNotInJoinKeys. The surrounding private instance methods are strictly alphabetical, so it should move between lookupColumnIndexesUsingVanillaNames (11088) and prepareLatestByColumnIndexes (11099).

Minor

m1 — maxRowsPerFrame == 0 hangs the descending scan. In nextImplDescending, subLo = Math.max(descPartitionLo, descSubHi - maxRowsPerFrame); with maxRowsPerFrame==0, subLo == descSubHi, so descSubHi never decreases and the loop spins forever. Production-unreachable (PropServerConfiguration.validatePageFrameRows rejects <1), but the test-only setMaxRowsPerFrameForTesting(0) hook would hang CI rather than fail. Cheap guard: long step = Math.max(1, maxRowsPerFrame);.

m2 — Pre-existing error-path leak in wrapCoveringWithFilter. If compileWorkerFiltersConditionally(...) (compiles JIT filters, can throw) or deepClone(...) throws before the AsyncFilteredRecordCursorFactory constructor takes ownership, coveringFactory + filter (+ limit function) leak — the only catch (~10548) frees only dfcFactory. Pre-existing, not a regression. The new mayBeNegativeLimit → init() adds one more throw point but it's essentially unreachable (init runs only for constant-folded limits, a no-op). Since this PR already carries an "error path leak fix" commit, consider wrapping the two callsites (10297, 10397) in a try/catch that frees coveringFactory + filter on throw.

m3 — Stale comment. CoveringIndexRecordCursorFactory.java:~213 says an empty multiKeys list "makes advanceKey() report no rows" — but advanceKey() now throws UnsupportedOperationException; the empty-list short-circuit lives in hasNext() → openNextPartitionCursors(). Reword.

m4 — Boolean field naming. protected boolean descending should be isDescending per the is…/has… rule — it sits directly above isExhausted, which uses the prefix.

m5 — Duplicated merge logic. The min-head selection loop and the park-before-probe partition-open loop are near-identical in MultiKeyCoveringCursor and MultiKeyCoveringPageFrameCursor (only openForwardCoveringCursor is shared). A future fix to the merge invariant must touch both. Consider a shared helper.

m6 — fillMergedFrame double-pass (micro). The anyHead pre-scan over N keys duplicates the first iteration of the per-row min-scan. Defer allocFrameBuffers() until after the first best >= 0 selection to fuse the two passes and drop the "allocate-then-discard" rationale.

m7 — Test convention. The explicit-value tests (testSinglePartitionInterleavedExplicit, testSingleKeyOrderByTsDescLimitExplicit, testReferenceOracleIsTimestampOrdered) use a custom assertResult/printSql instead of assertQueryNoLeakCheck, skipping the factory-property assertions (expectSize/expectedTimestamp) that matter for a LIMIT fix. The oracle cross-check method itself is correctly custom (it compares two queries and asserts the plan is genuinely covering).

m8 — Labels. Consider adding Performance (k-way merge + buffer-alloc commit) and regression (9.4.0 shipped getPageFrameCursor hardcoding ORDER_ASC — a user-facing correctness regression vs the plain-index path).

Verified safe (checked and cleared)

Concurrency: the reused frame object is safe — buildAddressCache/PageFrameAddressCache.add snapshots all frame addresses into off-heap lists on the single producer thread before any worker runs; covering frames are always NATIVE so no lazy re-read. Producer-only merge state.
Multi-key DESC CairoException.critical guard: unreachable defensive guard — getScanDirection()==FORWARD blocks ORDER BY ts DESC elision and recordCursorSupportsRandomAccess()==false blocks top-K; multi-key negative limits route to serial.
k-way merge duplicate/tie on shared row ids: cannot happen — WhereClauseParser dedups IN-list strings and the symbol map is a bijection, so distinct keys never share a row id (verified with sym IN ('A','A')).
Multi-key serial fallback last-N correctness, deferred-symbol resolution, off-by-one bounds, latestBy interaction, generateLimit double/dropped-limit: all traced and correct.

Summary

Merge after addressing the description accuracy (M1), member ordering (M4), and ideally the var-width / filter-drop negative-limit tests (M3). The performance items (M2) and minors are non-blocking. The multi-key output order now matches the plain-index path (timestamp-ordered), which is why 7 existing assertions were updated — a correct, user-visible consistency improvement.

_{🤖 Generated with Claude Code}

puzpuzpuz · 2026-06-01T13:45:49Z

Multi-key covering is only per-key-per-partition ordered, never globally timestamp-ordered, so a correct last-N-by-timestamp cannot be produced from it without a k-way merge.

Heads-up: k-way merge is introduced in #7150 (there is a separate bug which requires k-way merge similar to FilterOnValues cursor's one).

wrapCoveringWithFilter created the limit function and received a CoveringIndexRecordCursorFactory (which already owns the partition frame cursor factory) but had no cleanup on its own throw paths. A throw from mayBeNegativeLimit's init(), collectColumnIndexes, the worker-filter JIT compile, or the AsyncFilteredRecordCursorFactory constructor leaked the covering factory's cursors and key functions, the filter, and the freshly created limit function. The only enclosing handler freed the partition frame cursor factory alone. The method body now runs under a try/catch(Throwable) that frees the filter, the limit function, and the covering factory before rethrowing. Closing the covering factory also closes the partition frame cursor factory, so the two callers now null their own reference once they hand it to the covering factory, and the outer catch no longer double-frees it. The method frees the limit function through x = Misc.free(x) on the non-pushed and serial-fallback paths, keeping the catch idempotent when it was already released or transferred. Both factory constructors take no effective ownership on failure -- the half-built object never reaches a caller, so close() never runs -- so freeing all three inputs in the catch is correct. Add four negative-limit tests to CoveringIndexMultiKeyOrderingTest: - var-width covered columns (VARCHAR with a NULL, DOUBLE[]) on the single-key backward path, at default cap and 2-row sub-frames - the same columns on the multi-key merge and serial-fallback paths - a 10k-row single partition with LIMIT -5000 that forces a growFrameBuffers mid-fill relocation - a 2-row cap with a residual filter that drops rows in the surviving top descending sub-frame under a negative limit CoveringIndexMultiKeyOrderingTest 19/19 and CoveringIndexTest 378/378 pass under assertMemoryLeak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- nextImplDescending now derives the sub-frame low bound from Math.max(1, maxRowsPerFrame), so a test-only cap of 0 cannot stall the descending loop (subLo == descSubHi). - Move mayBeNegativeLimit, with its javadoc, into alphabetical order between lookupColumnIndexesUsingVanillaNames and prepareLatestByColumnIndexes. - Restore the fillFrameForKey rowCap/pendingRowCursor contract javadoc onto fillFrameForKey and give allocFrameBuffers its own short doc; the cap split had left the contract sitting on allocFrameBuffers. - Correct the MultiKeyCoveringCursor merge comment: it matches the result order HeapRowCursorFactory produces but uses a linear O(R*N) min-scan, not that path's O(log N) heap poll. - The unreachable backward-multi-key guard now throws CairoException.nonCritical() instead of critical(0), which falsely implied storage corruption. - Reword a stale comment: an empty multiKeys list makes hasNext()'s merge find no heads and openNextPartitionCursors() open nothing, rather than the never-called advanceKey() reporting no rows. CoveringIndexMultiKeyOrderingTest 19/19 and CoveringIndexTest 378/378 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…x-negative-limit

puzpuzpuz · 2026-06-01T16:32:24Z

Two moderate items I'd like to highlight. They're not blockers and are more or less essential for the chosen fix, but that's something we could improve in the future patches.

M1 — Negative-limit descending path over-reads/over-materializes [in-diff, perf]

SingleKeyCoveringPageFrameCursor.nextImplDescending (CoveringIndexRecordCursorFactory.java:2366) sizes each backward sub-frame to a fixed maxRowsPerFrame (default 1,000,000) window of row-ids and calls fillFrameForKey(..., Integer.MAX_VALUE), which materializes every matching row in that window into native buffers (writeCoveredRow) before the consumer reads anything. AsyncFilteredNegativeLimitRecordCursor.fetchAllFrames (:215) cancels only after the first frame is fully built. So ... LIMIT -5 over a hot key in a large latest partition copies up to ~1M rows to return 5.

Unlike the standard backward cursor, this path ignores pageFrameMinRows/sharedQueryWorkerCount — BwdTableReaderPageFrameCursor:480 sizes frames via calculatePageFrameRowLimit(...). The per-row copy is inherent to covering (no mmap'd column to point at, unlike the zero-copy BwdTableReaderPageFrameCursor), but the fixed 1M window is an avoidable parameter choice. Note: enabling covering can therefore be slower than the plain index for large LIMIT -N. Confirmed independently by Agents 3, 10, 11. Mitigation: thread the limit N down to bound the first sub-frame, and/or adopt calculatePageFrameRowLimit for consistency.

M2 — Multi-key negative-limit serial fallback scans ~2–3× [in-diff, perf tradeoff]

The PR routes multi-key negative/bind-var limits to FilteredRecordCursorFactory + LimitRecordCursorFactory. Because MultiKeyCoveringCursor.size() returns -1 (:1924), LimitRecordCursorFactory.ensureBoundsResolved (LimitRecordCursorFactory.java:279) runs a full calculateSize counting pass, then toTop + skip + take — reconstructing the k-way merge ~2–3 times. The PR explicitly accepts this throughput tradeoff, and it's correct. But SingleKeyCoveringCursor.size() (:2202) already shows the pattern: a cheap MultiKeyCoveringCursor.size()
summing each key's posting size() (which has an O(genCount) fast count path) would eliminate the redundant counting pass. Worth doing since this path is the deliberate fallback for a common shape. (Agents 3, 11)

…negative-limit

The CI "Applying formatting" step runs IntelliJ's CLI formatter and then fails the build via "git diff --exit-code" if it changes anything. Three .returns("""...""") text blocks had their content and closing delimiter committed at the same indent as the opening .returns call, while the formatter expects the content indented eight spaces deeper (matching every other text block in the file). Reindent the three blocks to the formatter's expectation. The change is whitespace-only: Java strips the common leading indentation of a text block based on its least-indented line, so the asserted string values are unchanged and the tests still verify the same output. This unblocks the "Rust Test and Lint on linux-jdk25" job, where the formatting gate runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bluestreak01 · 2026-06-02T09:48:19Z

/azp run macwin

azure-pipelines · 2026-06-02T11:49:11Z

Azure Pipelines successfully started running 1 pipeline(s).

bluestreak01 · 2026-06-02T13:08:19Z

/azp run macwin

azure-pipelines · 2026-06-02T13:08:28Z

Azure Pipelines successfully started running 1 pipeline(s).

mtopolnik · 2026-06-02T13:27:27Z

[PR Coverage check]

😍 pass : 243 / 256 (94.92%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/griffin/SqlCodeGenerator.java	40	45	88.89%
🔵	io/questdb/griffin/engine/table/CoveringIndexRecordCursorFactory.java	203	211	96.21%

Resolve conflicts from upstream PRs #7182 (LIMIT -N over a covering index), #7187 (fluent QueryAssertion builder) and #7157 (ORDER BY native-memory byte caps). CoveringIndexRecordCursorFactory: both this branch and #7182 independently reworked the multi-key merge to emit rows in row-id (ts-ascending) order. Take upstream's version as the base -- its min-scan merge plus the new single-key LIMIT -N backward scan -- then re-apply this branch's additive fixes on top: getScanDirection(), duplicate IN-key dedup, and the TablePageFrameCursor promotion (fixes the parallel-keyed-group-by-under-SelectedRecord ClassCastException). getScanDirection() reports OTHER only for multi-key latestBy; single-key latestBy stays FORWARD (it returns one row, trivially ts-ordered) so the designated-timestamp metadata upstream's tests expect is preserved. EncodedSortRecordCursor: keep upstream's extracted throwLimitOverflow(). Tests: take upstream's fluent-builder assertions for shared cases and keep this branch's new regression tests. #7187 removed the legacy positional assertQuery()/assertQueryNoLeakCheck() helpers, so migrate ~35 remaining call sites across 10 test files to the fluent builder. Verified: core test-compile clean; CoveringIndexTest (394), ExplainPlanTest (534), AsOfJoinTest (115), CompiledFilterTest (50), LimitTest (47) and the migrated group-by / distinct / IPv4 / array tests all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Bring in master's #7182 ("return the last N rows for LIMIT -N over a covering index"). Its production changes to SqlCodeGenerator and CoveringIndexRecordCursorFactory apply cleanly; this branch touched neither file. The only conflict was in CoveringIndexTest.java. Master added four LIMIT -N regression tests (testNegativeLimit*) plus two private helpers, getPlan and assertPlanDoesNotContain. This branch had already deleted both helpers when it migrated covering-index plan checks to the QueryAssertion builder (.assertsPlanContaining / .assertsPlanNotContaining). Resolution: keep master's four new tests and restore getPlan, which those tests call. Drop assertPlanDoesNotContain, which now has no callers - this matches the branch's deletion and master's intent of adding the tests. The new tests still use assertSql/getPlan rather than the builder; migrating them is left as a follow-up since they exercise the parallel Async Filter path whose factory does not fit the builder's full random-access battery without per-test tuning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bluestreak01 added Bug Incorrect or unexpected behavior Core Related to storage, data type, etc. SQL Issues or changes relating to SQL execution labels May 29, 2026

bluestreak01 and others added 3 commits May 29, 2026 19:01

Merge branch 'master' into fix-covering-index-negative-limit

9e685a3

Merge branch 'master' into fix-covering-index-negative-limit

3d82522

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

bluestreak01 and others added 2 commits June 1, 2026 12:29

error path leak fix

c5673b3

puzpuzpuz self-requested a review June 1, 2026 13:28

puzpuzpuz and others added 4 commits June 1, 2026 18:03

Merge remote-tracking branch 'upstream/master' into fix-covering-inde…

39fa05b

…x-negative-limit

Merge branch 'master' into fix-covering-index-negative-limit

c3bc3df

puzpuzpuz previously approved these changes Jun 1, 2026

View reviewed changes

bluestreak01 added 2 commits June 1, 2026 18:06

Merge remote-tracking branch 'origin/master' into fix-covering-index-…

b3e16c0

…negative-limit

merge fallout

55ca668

bluestreak01 dismissed puzpuzpuz’s stale review via 55ca668 June 1, 2026 17:23

Merge branch 'master' into fix-covering-index-negative-limit

fa4284e

bluestreak01 merged commit 8a53c15 into master Jun 2, 2026
53 checks passed

bluestreak01 deleted the fix-covering-index-negative-limit branch June 2, 2026 14:17

puzpuzpuz mentioned this pull request Jun 2, 2026

fix(sql): fix wrong results and memory leaks in posting-index, join and group-by queries #7150

Merged

bluestreak01 mentioned this pull request Jun 2, 2026

fix(core): prevent posting index wrong results after failed out-of-order commit #7192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sql): return the last N rows for LIMIT -N over a covering index#7182

fix(sql): return the last N rows for LIMIT -N over a covering index#7182
bluestreak01 merged 14 commits into
masterfrom
fix-covering-index-negative-limit

bluestreak01 commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Uh oh!

bluestreak01 commented Jun 1, 2026

Uh oh!

puzpuzpuz commented Jun 1, 2026

Uh oh!

puzpuzpuz commented Jun 1, 2026

Uh oh!

bluestreak01 commented Jun 2, 2026 •

edited

Loading

Uh oh!

azure-pipelines Bot commented Jun 2, 2026

Uh oh!

bluestreak01 commented Jun 2, 2026

Uh oh!

azure-pipelines Bot commented Jun 2, 2026

Uh oh!

mtopolnik commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

bluestreak01 commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Tradeoffs

Test plan

Summary by CodeRabbit

New Features

Bug Fixes

Tests

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

bluestreak01 commented Jun 1, 2026

Code review

Critical

Moderate

Minor

Verified safe (checked and cleared)

Summary

Uh oh!

puzpuzpuz commented Jun 1, 2026

Uh oh!

puzpuzpuz commented Jun 1, 2026

M1 — Negative-limit descending path over-reads/over-materializes [in-diff, perf]

M2 — Multi-key negative-limit serial fallback scans ~2–3× [in-diff, perf tradeoff]

Uh oh!

bluestreak01 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azure-pipelines Bot commented Jun 2, 2026

Uh oh!

bluestreak01 commented Jun 2, 2026

Uh oh!

azure-pipelines Bot commented Jun 2, 2026

Uh oh!

mtopolnik commented Jun 2, 2026

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bluestreak01 commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading

bluestreak01 commented Jun 2, 2026 •

edited

Loading