Skip to content

bytes_required returns one byte too many for values like -128, -32768 #3522

@vishnuprakaz

Description

@vishnuprakaz

Apache Iceberg version

None

Please describe the bug 🐞

Describe the bug

bytes_required (pyiceberg/utils/decimal.py) is documented to
return the minimum number of bytes for a value, but it
returns one byte too many for negatives equal to -2^(8k-1) (e.g.
-128, -32768).

You can see the contradiction directly it claims "minimum" but
returns 2 for a value that clearly fits in 1 byte:

from pyiceberg.utils.decimal import bytes_required

print(bytes_required(-128))                    # 2
print((-128).to_bytes(1, "big", signed=True))  # b'\x80'  <-
-128 fits in 1 byte

This matters because the decimal bucket transform hashes these
bytes, so the extra byte makes PyIceberg pick a different bucket
than Spark/Java for the same value, the same row can land in a
different partition depending on the engine.

Reproducer

from decimal import Decimal
from pyiceberg.types import DecimalType
from pyiceberg.transforms import BucketTransform

dt = DecimalType(precision=5, scale=2)
print(BucketTransform(num_buckets=16).transform(dt)(Decimal("-1.
28")))  # 12; should be 13

Expected behavior

Minimal byte encoding (-128 → 1 byte), matching the Iceberg spec
and Spark/Java, so bucketing agrees across engines (bucket 13
here).

Low frequency (only values equal to -2^(8k-1)), but a genuine
cross-engine correctness bug.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions