Skip to content

zoneinfo: pure-Python POSIX TZ Jn/n day-of-year field accepts non-digit input via int() (C rejects) #152847

Description

@tonghuaroot

Bug report

Bug description

In Lib/zoneinfo/_zoneinfo.py, _parse_dst_start_end() validates the Mm.w.d
transition rule strictly with an re.ASCII fullmatch, but the Jn (Julian)
and n (0-based) day-of-year branches fall through to a bare int(date) with
no format guard:

    else:
        if type == "J":
            n_is_julian = True
            date = date[1:]
        else:
            n_is_julian = False

        doy = int(date)          # <-- no ASCII / format check
        offset = _DayOffset(doy, n_is_julian)

int() accepts things the C accelerator's day-of-year parser rejects. The C
side (Modules/_zoneinfo.c, parse_transition_rule) reads the field with
parse_digits(&ptr, 1, 3, &day), which consumes 1 to 3 ASCII digits via
Py_ISDIGIT and nothing else. So the two implementations disagree on the same
POSIX TZ string.

The most serious case is a silent miscompile, not a crash: int('1_0')
is 10 (PEP 515 underscore grouping), so a TZ string like
AAA4BBB,J1_0,J300/2 builds a valid but different zone (DST starts on day
10) in pure Python, while the C accelerator raises ValueError. A program
that relies on the pure fallback silently computes wrong local times instead
of reporting the malformed rule.

Other pure-accept / C-reject inputs for the day-of-year field: a leading +
(J+1), a leading space (J 1), 4-or-more-digit widths (J0001), and
non-ASCII digits (Arabic-Indic ).

Differential (main, before fix)

TZ template AAA4BBB,<token>,J300/2, only <token> varies; loaded through
both implementations via a crafted TZif v2+ footer:

token C accelerator pure-Python (before)
J1_0 reject accept — day 10 (silent miscompile)
1_0 reject accept — day 10 (silent miscompile)
J+1 reject accept
+1 reject accept
J 1 reject accept
1 reject accept
J0001 reject accept
0001 reject accept
(Arabic 1) reject accept
١ reject accept
J01, J001 accept accept (agree; 1-3 digit leading zeros are valid)
J1, J365, 0, 365 accept accept (valid controls)
J366, J400, J1234 reject reject (agree; range/width)

10 divergent inputs. The C accelerator consumes at most 3 digits, so
J0001 (4 digits) is rejected by C — any fix must not accept it either.

CPython versions

main (3.16). The pure-Python parser has carried this since the POSIX TZ
support was added.

Fix

Add an re.ASCII digit guard matching C's parse_digits(&ptr, 1, 3, &day)
(1 to 3 ASCII digits) before int(), in the J/n branch only:

        if re.fullmatch(r"\d{1,3}", date, re.ASCII) is None:
            raise ValueError(f"Invalid dst start/end date: {dststr}")
        doy = int(date)

This makes pure exactly match C: it rejects the 10 divergent inputs, still
accepts the leading-zero J01/J001 forms C accepts, and leaves the existing
_DayOffset range check ([julian, 365]) to reject out-of-range values, so
no numeric-range behaviour changes. All 499 bundled IANA zones parse
byte-identically through both implementations after the fix.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions