Add LPDiD (Local Projections DiD) estimator (Dube et al. 2025)#575
Conversation
Local Projections DiD (Dube, Girardi, Jorda & Taylor 2025), absorbing-treatment main path. Per-horizon long-difference LP regression on a clean-control sample; variance-weighted (default) and equally-weighted (reweight) estimands; regression-adjustment / direct covariate paths; premean-differenced base periods; pooled pre/post; no-composition; cluster-robust SE at unit (t(G-1)). - diff_diff/lpdid.py, lpdid_results.py: estimator + finished results class (summary/to_dict/to_dataframe/repr, cluster metadata). RA-path influence- function variance routed through linalg._rank_guarded_inv; covariate- homogeneity UserWarning on the direct-inclusion path. - Registered in diff_diff/__init__; doc-deps, api/lpdid.rst, llms.txt (18->19), llms-full.txt block + coverage test, README, choosing_estimator, REGISTRY deviations. - tests/test_lpdid.py (35 tests): analytical DGP recovery + cross-estimator equivalences (reweighted == CallawaySantAnna; pmd single-cohort == BJS; 2x2 h=0 == first-difference DiD). make_lpdid_panel DGP helper. R-parity against the authors' lpdid packages is the B2 follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on; summary alpha Addresses unbalanced-panel correctness: - Reweight equal-weighting denominators now come from the realized post-drop sample (was the pre-drop panel), preserving the Callaway-Sant'Anna equivalence on unbalanced panels. - Regression-adjustment path drops treated observations at event times with no clean control (counterfactual unidentified) with a UserWarning, and guards the SE to stay NaN-consistent with a NaN point estimate (the prior groupby-sum vectorization silently turned all-NaN scores into 0.0). - Validate pre_window / post_window as non-negative integers. - LPDiDResults.summary() no longer accepts an alpha override (it would relabel the CI level without recomputing the displayed intervals). - Docs: clarify no_composition fixes the post-treatment composition; reword the api/lpdid.rst covariate-path wording (RA is preferred, not auto-default). - Tests: unbalanced reweight==CS, RA NaN-consistency, fixed post composition, window validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation - PMD paths require the ACTIVE baseline column (the premean column) rather than always the exact t-1 outcome, so an unbalanced-panel observation with a valid PMD baseline but a missing t-1 outcome is no longer silently dropped. - Reject empty pooled pre windows (pre_window < 2 with pooled output requested) with a clear ValueError; exclude the h=-1 reference horizon from the supported pooled-pre window (it would inject mechanical zero long differences). - Tests for PMD missing-t-1 retention, empty pooled pre window, and -1 rejection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ficient_action - no_composition now requires every post target (h=0..H) observed, not just the maximum horizon, so post-horizon samples stay fixed under non-monotone unbalanced missingness; the fixed-composition mask is applied only to post event horizons and the pooled-post window (pre-placebos use available data). - Propagate the public rank_deficient_action parameter into LPDiDResults and to_dict(). - Docs: qualify the variance-weighted "stacked" equivalence as Cengiz et al. (2019)-style, not diff-diff's Wing-et-al StackedDiD. - Tests: non-monotone no_composition fixed post sample, rank_deficient_action propagation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The regression-adjustment covariate path uses an influence-function cluster
variance (ImputationDiD/BJS family), not an OLS CR1 sandwich. Record
vcov_type="if_cluster" for that path (reweight + covariates/lags/absorb) and
render an accurate summary label ("Influence-function cluster-robust"), so the
results metadata and summary no longer mislabel it as hc1 / CR1.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rols warn - pmd="max" premean baseline now divides by the count of NON-MISSING prior outcomes (not prior rows), so a present-but-NaN pretreatment outcome no longer deflates the baseline and silently biases the estimate (cumsum already skips NaN, so only the denominator was wrong). - The direct-inclusion homogeneity warning (reweight=False) now also fires for ylags / dylags, which are direct-included controls subject to the same non-negative-weight caveat (online Appendix B.2.2); warning reworded to "covariate-style controls". - Tests: pmd="max" NaN-history exclusion, ylags/dylags warning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… lag metadata - no_composition now requires the active baseline and every post-treatment target OUTCOME to be non-missing (value-based via reindex), so the fixed post composition holds under any missingness encoding -- absent rows OR present-but-NaN outcomes -- not merely row existence. - The pooled-window identification pre-check uses base identification (apply_no_composition=False); pooled estimation applies its own pooled-window composition mask, so a narrower post_pooled with missing far-horizon data is no longer spuriously rejected as unidentified. - Record ylags / dylags on LPDiDResults and to_dict() for auditability. - Reject bool pre_pooled / post_pooled (bool is an int in Python). - Tests: present-but-NaN no_composition, bool rejection, lag metadata. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e dead code - Validate alpha in (0, 1) and require a numeric `time` column (clear error instead of a cryptic downstream failure on irregular/datetime labels). - Results-level n_clusters now reports the realized cluster count of the pooled-post headline row (the summary "G"), not the full-panel count; per-row realized counts remain in the event_study / pooled tables. - Remove the now-unused _outcome_available_mask (superseded by the value-based availability check in _common_clean_sample_indicator). - Document the numeric-time requirement in api/lpdid.rst. - Tests for alpha / time validation and headline n_clusters. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aN clusters - Validate the global time grid is integer-spaced by 1 (not merely numeric): irregular grids (e.g. 2000, 2002, ...) raise rather than silently producing empty t+h horizons / inconsistent horizon meanings. - Reject str covariates / absorb (would iterate character-by-character). - Reject missing values in the effective cluster column (they would silently drop from the RA cluster variance via groupby, biasing SEs with no warning). - Tests for all three. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_prepare_panel built outcome lags, first differences, and integer-pmd premean baselines (plus treatment-entry detection) with row-order ops (shift/diff/ rolling), which equate "previous observed row" with calendar t-1 -- correct only on a regular per-unit grid. On a unit with an interior time gap (observed t=0,2 missing t=1, possible in an unbalanced panel even when the global grid is consecutive) they silently used the wrong period. Fix: reindex each unit to its complete interior calendar grid, compute the features on the grid (every row-order op now indexes true calendar time), then restrict back to the observed rows -- a lag/difference spanning a gap is NaN and the observation fails closed; no synthetic NaN-cluster gap row reaches a regression or the reweight denominators. A gap-free panel skips the reindex (early-out) and is bit-identical. Absorbing/cluster validation runs on observed rows before reindex; treatment is absorbing-filled on the grid (exact on observed rows). Also require integer-valued time labels (the spacing check admitted fractional 0.5, 1.5). Documented in REGISTRY (interior-gap handling + "entry = first observed treated" convention). 7 new interior-gap tests; existing 56 unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_pmd_all_baseline subtracted panel[outcome] (the base row's own y_t), so a base
row with a missing current outcome got a NaN premean baseline and was silently
dropped at every horizon -- even though PMD's premean uses only PRIOR outcomes
and the long difference is y_{t+h} - premean (y_t is never used). A treated
entry whose entry-period outcome is missing thus lost all treatment variation,
yielding NaN coefficients. Build the numerator from fillna(0) cumulative sums
(the strictly-prior non-missing sum), independent of y_t; bit-identical when no
outcome is NaN. Regression test: pmd="max" retains the h>0 treated obs when the
entry-period outcome is missing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- RA path (reweight=True with absorb): a treated observation whose absorbed level has no clean-control support has an all-zero control dummy, so its counterfactual would extrapolate through an unidentified coefficient. Drop those treated obs with a warning (mirrors the existing event-time identification check), never impute off a non-identified fit. - Document the pooled pre/post estimand in REGISTRY (Note 6): the unit-equal- weighted average of each unit-event-time's mean long difference on the fixed-composition sample (every pooled target observed); equals the mean of the per-horizon event-study coefficients on a balanced panel, and differs from the authors' horizon-stacked pooled regression under cross-horizon composition changes -- exact parity reconciled in PR-B2. - Tests: RA unsupported-absorbed-level drop; pooled-post == mean event-study. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rol label - Reject covariates/absorb names colliding with LPDiD working columns (names starting with "_" or "horizon") before panel construction, so a user column cannot silently overwrite an internal column. - Move the covariates/absorb string-vs-list check before the required-columns build, so absorb="region" raises the precise "must be a list" error rather than an iterate-characters missing-column error. - Relabel the summary's "Control units" -> "Never-treated units" (the count is never-treated units; clean controls also include not-yet-treated cohorts). - Check off the B1 pure-Python test checklist row in REGISTRY. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ol_units - to_dict() now includes covariates and absorb (the result stores them and summary() displays them; serialized results were dropping the adjustment specification). - Document that LPDiDResults.n_control_units counts never-treated units only (the library-wide field convention; the realized clean-control pool also includes not-yet-treated cohorts, whose per-horizon counts are in the table columns). summary() already labels it "Never-treated units". - Test docstring: external R-parity "will live in" test_methodology_lpdid.py (added in PR-B2), not present in this PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…LS path The RA path drops treated observations at event times with no clean control, but the default (reweight=False) path only checked global treated/control presence. A treated event time with no clean control makes its time fixed effect collinear with the treatment indicator; the rank handler could drop that time dummy and identify the effect off invalid cross-event-time comparisons, yielding a spurious finite estimate (e.g. the last-treated cohort under control_group="clean" with no never-treated units). Mirror the RA event-time identification check in _estimate_sample (covers both the default event-study and the default pooled path): drop unsupported treated obs with a warning, NaN if none remain. A heterogeneous-effect regression test proves the unidentified cohort no longer contaminates the estimate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n docs - summary() now lists nonzero ylags/dylags on the controls line (covariates and absorb were already shown). - llms-full.txt: clarify SEs are unit-cluster CR1 on the default/weighted path but the influence-function cluster variance on the RA covariate path. - llms-full.txt + api/lpdid.rst: no_composition fixes the POST-treatment composition across POST horizons (pre-placebos may vary), matching the implementation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The RA-vs-default finite-sample-factor note said the asymmetry "is validated against the reference R packages" while the same section marks R parity as pending (PR-B2). Reword to "is documented here and will be validated ... in the R-parity follow-up" so the status is not overclaimed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Overall AssessmentExecutive Summary
MethodologyFinding: RA path does not handle rank-dropped nuisance coefficientsSeverity: P1 Impact: The registry documents the RA path as using clean-control regression adjustment with rank-guarded variance behavior, and the library-wide Common triggers: constant covariate, duplicate covariate, absorbed level collinear with time FE, or lag/covariate collinearity in a small clean-control sample. Concrete fix: After Informational: documented deviations are mitigatedSeverity: P3 Code QualityNo additional findings beyond the P1 RA rank-handling issue above. PerformanceNo findings. MaintainabilityNo additional findings. The LPDiD docs and guide-signature tests help reduce API drift. Tech DebtFinding: external R/Stata parity remains deferredSeverity: P3 SecurityNo findings. Documentation/TestsFinding: missing test for rank-deficient RA covariatesSeverity: P1, tied to the Methodology finding Path to Approval
|
…ficients solve_ols sets dropped redundant-column coefficients to NaN under rank_deficient_action="warn"/"silent"; the RA path then multiplied the design by control_coef, propagating NaN through every prediction and NaN-ing an otherwise-identified ATT (triggered by a constant/duplicate covariate, a collinear absorbed level, or lag collinearity). Zero-fill the dropped coefficients before prediction/residuals -- the dropped column's effect is absorbed by the retained collinear column(s), so it acts as 0. "error" still raises inside solve_ols. Regression test: redundant duplicate + constant covariates keep a finite ATT equal to the no-redundant-column fit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — no unmitigated P0/P1 findings in the changed diff. Executive Summary
MethodologyFinding: Previous RA rank-deficiency issue resolvedSeverity: P3 informational Finding: Documented deviations are mitigatedSeverity: P3 informational Code QualityFinding: NoneSeverity: N/A PerformanceFinding: NoneSeverity: N/A MaintainabilityFinding: NoneSeverity: N/A Tech DebtFinding: External R/Stata parity remains trackedSeverity: P3 informational SecurityFinding: NoneSeverity: N/A Documentation/TestsFinding: Local test execution unavailableSeverity: P3 informational |
Summary
LPDiD(Local Projections Difference-in-Differences; Dube, Girardi, Jordà & Taylor 2025), the absorbing-treatment path: a separate per-horizon OLS of the long differencey_{i,t+h} − y_{i,t−1}on the treatment-switch indicator with calendar-time fixed effects (no unit FE), restricted to a clean-control sample (newly-treated + not-yet-treated), so the default variance-weighted estimand has strictly non-negative weights. Options:reweight=True(equally-weighted ATT, numerically equivalent to Callaway-Sant'Anna),pmd(premean-differenced base period),no_composition(fixed post-treatment composition), pooled pre/post estimands,ylags/dylagslag controls, and a regression-adjustment covariate path.LPDiDResults(summary()/to_dict()/to_dataframe()/headline aliases/cluster metadata), all inference routed throughsolve_ols/_rank_guarded_inv/safe_inference, registered in__init__, and full doc ceremony (REGISTRY## LPDiD, references,docs/api/lpdid.rst,llms.txt/llms-full.txt, README, choosing-estimator, doc-deps, CHANGELOG).control_group="clean") instead of identifying off a collinear time-FE drop.pmd="max"premean depends only on prior outcomes (never the base row's owny_t); value-basedno_compositionfixed composition under any missingness encoding; RA absorbed-factor overlap check; reserved-name / integer-time / missing-cluster guards; influence-function cluster-variance labeling for the RA path.Methodology references
REGISTRY.md## LPDiD→ Deviations): cluster-robust SE at unit level with at(G−1)reference (the paper specifies no SE; matches Statalpdidvce(cluster unit)); RA-path influence-function cluster variance (ImputationDiD/BJS family, no finite-sample factor) — reconciled in the R-parity follow-up; absorbing-treatment scope only; fixed-composition pooled estimand (vs the authors' horizon-stacked pooled regression) — reconciled in the R-parity follow-up; interior-gap reindex + "entry = first observed treated" convention.Validation
tests/test_lpdid.py(69 tests — API/validation, analytical closed-form, DGP recovery, cross-estimator equivalences [reweight== Callaway-Sant'Anna exact to 1e-15;pmd="max"single-cohort == BJS/ImputationDiD], and unbalanced / interior-gap / clean-control-support / RA-overlap / pmd-missing edge cases), plustests/test_guides.pyllms-full signature coverage and the doc-deps/discoverability ceremony suites. External R-package parity (authors'danielegirardi/lpdid+ cross-checkedalexCardazzi/lpdid) is the scope of a tracked follow-up (PR-B2).Security / privacy
🤖 Generated with Claude Code