Skip to content

Add Join Order Benchmark (JOB) to tests/benchmarks#107756

Open
ayakovlev-clickhouse wants to merge 1 commit into
masterfrom
add-job-benchmark
Open

Add Join Order Benchmark (JOB) to tests/benchmarks#107756
ayakovlev-clickhouse wants to merge 1 commit into
masterfrom
add-job-benchmark

Conversation

@ayakovlev-clickhouse

Copy link
Copy Markdown
Contributor

Introduce the Join Order Benchmark (JOB) following the existing structure of the tpc-h and tpc-ds benchmarks: init.sql with the schema, settings.json, and a queries/ folder with the 113 ClickHouse-formatted queries.

The README.md documents loading the data from the public Parquet exports on S3, preparing it from the original PostgreSQL-style CSV files via the included convert_csv.py, and lists the queries that deviate slightly from the canonical JOB queries (to avoid empty results).

Changelog category (leave one):

  • Documentation (changelog entry is not required)

...

Introduce the Join Order Benchmark (JOB) following the existing structure of
the `tpc-h` and `tpc-ds` benchmarks: `init.sql` with the schema, `settings.json`,
and a `queries/` folder with the 113 ClickHouse-formatted queries (zero-padded
`query_NNx.sql` naming).

The `README.md` documents loading the data from the public Parquet exports on S3,
preparing it from the original PostgreSQL-style CSV files via the included
`convert_csv.py`, and lists the queries that deviate slightly from the canonical
JOB queries (to avoid empty results).

Co-authored-by: Cursor <cursoragent@cursor.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [3a4b295]

Summary:


AI Review

Summary

This PR adds a standalone Join Order Benchmark under tests/benchmarks/job with schema, settings, data-loading docs, a CSV converter, and 113 query files. The benchmark structure is mostly self-contained, but the CSV converter has a real failure-path issue: truncated quoted input can be accepted with exit code 0, producing an incomplete dataset without any error.

Missing context / blind spots
  • ⚠️ No local clickhouse binary was present in this checkout, so I could not run init.sql or parse/run all 113 queries. A smoke run against the documented Parquet data would close this gap.
Findings

⚠️ Majors

  • [tests/benchmarks/job/convert_csv.py:101] convert_csv.py silently drops an incomplete final quoted record. When pending remains at EOF and parse_record returns (None, False), the script exits 0; with earlier complete rows it emits only those rows. This can build a truncated JOB dataset from a corrupted or partial CSV download without surfacing an error.
    Suggested fix: return a non-zero exit code, with an error on stderr, whenever pending is still incomplete at EOF.
Tests
  • ⚠️ Add or run the smallest regression check for convert_csv.py EOF handling, for example an unterminated quoted input that must fail non-zero instead of producing partial output.
Final Verdict

Status: ⚠️ Request changes

Minimum required action: make convert_csv.py fail on incomplete final records instead of silently dropping them.

@clickhouse-gh clickhouse-gh Bot added the pr-documentation Documentation PRs for the specific code PR label Jun 17, 2026
@rienath rienath self-assigned this Jun 17, 2026
if complete:
writer.writerow(fields)
pending = None
if pending is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_csv.py should fail when parse_record reports an incomplete final record. With the current EOF path, an input such as 1,"unterminated\n reaches this block, parse_record returns (None, False), and the script exits 0 after writing no error/output; if earlier records were complete, it emits only those rows. That can silently build a truncated JOB dataset from a corrupted download. Please return a non-zero exit code whenever pending remains incomplete at EOF instead of dropping it.

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though they are not wired into CI now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-documentation Documentation PRs for the specific code PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants