forked from igerber/diff-diff
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathchaisemartin_dhaultfoeuille_bootstrap.py
More file actions
1175 lines (1088 loc) · 56 KB
/
Copy pathchaisemartin_dhaultfoeuille_bootstrap.py
File metadata and controls
1175 lines (1088 loc) · 56 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
"""
Multiplier-bootstrap inference for the de Chaisemartin-D'Haultfoeuille (dCDH)
estimator.
The dCDH papers prescribe only the analytical cohort-recentered plug-in
variance from Web Appendix Section 3.7.3 of the dynamic companion paper.
This module adds an opt-in multiplier bootstrap, clustered at the group
level by default (matching the inference convention used by
``CallawaySantAnna``, ``ImputationDiD``, and ``TwoStageDiD``). Under
``survey_design`` with an explicitly-coarser PSU, the bootstrap switches
to PSU-level Hall-Mammen wild clustering: each PSU draws a single
multiplier and all groups within that PSU share it (see
``_generate_psu_or_group_weights`` and ``_map_for_target`` below, plus
the REGISTRY.md ``ChaisemartinDHaultfoeuille`` Note on survey +
bootstrap). Under the default auto-inject ``psu=group`` each group is
its own PSU and the identity-map fast path reproduces the original
group-level behavior bit-for-bit.
**Cell-level wild PSU bootstrap (within-group-varying PSU):** when a
survey design's PSU varies across the cells of a group, the group-level
PSU map cannot represent per-PSU contributions (a group with cells in
PSUs ``{p1, p2}`` would collapse to the first PSU, under-clustering the
bootstrap). This module's dispatcher (``_psu_varies_within_group``)
detects the regime and switches to a cell-level allocator: each
observation-level ``psi_i`` is attributed to its ``(g, t)`` cell's PSU
via ``psu_codes_per_cell`` (shape ``(n_eligible_groups, n_periods)``,
-1 sentinel for zero-weight cells), and the bootstrap statistic becomes
``theta_r = sum_c multiplier[psu(c)] * u_centered_pp[c] / divisor``
(using the cohort-recentered per-cell IF ``U_centered_per_period``).
Under PSU-within-group-constant the row-sum identity
``sum_{c in g} u_cell[c] == u_centered[g]`` (enforced by
``_cohort_recenter_per_period``) makes the cell-level and group-level
bootstraps statistically equivalent, and the dispatcher routes to the
legacy group-level path for bit-identity with pre-cell-level releases.
Multi-horizon bootstraps draw a single shared ``(n_bootstrap, n_psu)``
PSU-level weight matrix per block and broadcast per-horizon via each
horizon's cell-to-PSU map, so the sup-t simultaneous band remains a
valid joint distribution. The bootstrap is a library extension, not a
paper requirement, and is documented in ``REGISTRY.md``.
The mixin operates on **pre-computed cohort-centered influence-function
values**: the main estimator class computes per-group ``U^G_g`` values
(and, under survey designs, per-``(g, t)``-cell attributions
``U_centered_per_period[g, t]``) during the analytical variance
calculation, recenters them by their cohort means (using the
``(D_{g,1}, F_g, S_g)`` triple), and stores the recentered vector /
tensor. The bootstrap then multiplies this structure by random
multiplier weights (Rademacher / Mammen / Webb) and re-aggregates to
produce a bootstrap distribution per target.
"""
from typing import TYPE_CHECKING, Any, Dict, Optional, Tuple
import numpy as np
from diff_diff.bootstrap_utils import (
compute_effect_bootstrap_stats as _compute_effect_bootstrap_stats,
)
from diff_diff.bootstrap_utils import (
generate_bootstrap_weights_batch as _generate_bootstrap_weights_batch,
)
from diff_diff.chaisemartin_dhaultfoeuille_results import DCDHBootstrapResults
__all__ = ["ChaisemartinDHaultfoeuilleBootstrapMixin"]
class ChaisemartinDHaultfoeuilleBootstrapMixin:
"""
Bootstrap-inference mixin for ``ChaisemartinDHaultfoeuille``.
Provides a single entry point ``_compute_dcdh_bootstrap`` that takes
pre-computed centered influence-function values for each estimand
target (overall ``DID_M``, joiners ``DID_+``, leavers ``DID_-``,
placebo ``DID_M^pl``) and returns a populated
:class:`DCDHBootstrapResults`.
The mixin is pure (no instance state of its own); it only references
instance attributes from the main class via ``TYPE_CHECKING`` hints.
"""
# --- Type hints for attributes accessed from the main class ---
n_bootstrap: int
bootstrap_weights: str
alpha: float
seed: Optional[int]
if TYPE_CHECKING: # pragma: no cover
def _placeholder(self) -> None: ... # silences mypy "no attributes" warnings
def _compute_dcdh_bootstrap(
self,
n_groups_for_overall: int,
u_centered_overall: np.ndarray,
divisor_overall: int,
original_overall: float,
joiners_inputs: Optional[Tuple[np.ndarray, int, float, Optional[np.ndarray]]] = None,
leavers_inputs: Optional[Tuple[np.ndarray, int, float, Optional[np.ndarray]]] = None,
placebo_inputs: Optional[Tuple[np.ndarray, int, float, Optional[np.ndarray]]] = None,
# --- Phase 2: multi-horizon inputs ---
multi_horizon_inputs: Optional[
Dict[int, Tuple[np.ndarray, int, float, Optional[np.ndarray]]]
] = None,
placebo_horizon_inputs: Optional[
Dict[int, Tuple[np.ndarray, int, float, Optional[np.ndarray]]]
] = None,
# --- Phase 3: per-path (by_path) bootstrap inputs ---
# Nested dict keyed by path tuple -> horizon -> 4-tuple of
# (u_centered_path, n_l_path, effect_path, u_per_period_optional).
# The 4th slot is always None in the current release (survey +
# by_path + bootstrap combination is gated out in fit(); cell-
# level wild PSU bootstrap for per-path targets is a future
# wave item).
path_bootstrap_inputs: Optional[
Dict[
Tuple[int, ...],
Dict[int, Tuple[np.ndarray, int, float, Optional[np.ndarray]]],
]
] = None,
# --- Phase 3: per-path placebo (by_path + placebo) bootstrap inputs ---
# Nested dict keyed by path tuple -> positive lag l (l = 1..L_max)
# -> 4-tuple `(u_centered_pl_path, n_pl_l_path, effect_pl_path,
# None)`. Sibling of `path_bootstrap_inputs` for backward
# placebo horizons; same reason for the 4th-slot `None`
# (survey + by_path + placebo + bootstrap is gated out, no
# cell-level path needed).
path_placebo_bootstrap_inputs: Optional[
Dict[
Tuple[int, ...],
Dict[int, Tuple[np.ndarray, int, float, Optional[np.ndarray]]],
]
] = None,
# --- Survey: PSU-level bootstrap under survey designs ---
group_id_to_psu_code: Optional[Dict[Any, int]] = None,
eligible_group_ids: Optional[np.ndarray] = None,
# --- Survey: cell-level wild PSU bootstrap ---
u_per_period_overall: Optional[np.ndarray] = None,
psu_codes_per_cell: Optional[np.ndarray] = None,
) -> DCDHBootstrapResults:
"""
Compute multiplier-bootstrap inference for all dCDH targets.
Each target ``T`` is summarized by:
- a centered influence-function vector of length equal to the
number of groups contributing to ``T``
- a re-aggregation **divisor**, which is the *switching-cell*
count from the Theorem 3 weighting formula (NOT a group
count). For ``DID_M`` the divisor is ``N_S = sum_t (N_{1,0,t}
+ N_{0,1,t})``; for ``DID_+`` it is ``sum_t N_{1,0,t}``; for
``DID_-`` it is ``sum_t N_{0,1,t}``. See REGISTRY.md
``ChaisemartinDHaultfoeuille`` for the cell-count weighting
contract.
- the original point estimate of ``T`` (used as the centering
point for the percentile p-value)
For each target, this method dispatches between two
allocator paths based on ``psu_codes_per_cell`` (via the
``_psu_varies_within_group`` helper):
**Legacy group-level path** — runs when
``psu_codes_per_cell`` is ``None`` or PSU is within-group-
constant (PSU=group auto-inject, strictly-coarser PSU with
within-group constancy, or no survey design). For each target:
1. Generates an ``(n_bootstrap, n_groups_target)`` matrix of
multiplier weights via
:func:`~diff_diff.bootstrap_utils.generate_bootstrap_weights_batch`,
where ``n_groups_target`` is the IF vector length (one
weight per contributing group). When a coarser PSU map is
passed, weights are drawn once per PSU and broadcast to
groups so all groups in the same PSU share a multiplier.
2. Computes the bootstrap distribution as
``W @ u_centered / divisor`` (one bootstrap replicate per
row), where ``divisor`` is the switching-cell count
described above. Note: the weight matrix has one column per
contributing group, but the divisor is a cell count — the
two are different quantities (groups can contribute to
multiple cells across periods).
3. Passes the distribution + the original point estimate through
:func:`~diff_diff.bootstrap_utils.compute_effect_bootstrap_stats`
to obtain ``(SE, CI, p_value)``.
**Cell-level wild PSU path** — runs when PSU varies within
group (any row of ``psu_codes_per_cell`` has > 1 unique non-
sentinel code). For each target:
1. Unrolls the target's ``u_per_period`` tensor and
``psu_codes_per_cell`` to 1-D cell-level arrays via
:func:`_unroll_target_to_cells`, dropping cells with
sentinel (-1) PSU. Raises ``ValueError`` if cohort-
recentered mass lands on a sentinel cell (the terminal-
missingness guard — see REGISTRY.md).
2. For **single-target** branches (overall, joiners, leavers,
placebo-horizon-per-l), generates an
``(n_bootstrap, n_cells_in_target)`` weight matrix where
each cell gets the multiplier of its PSU code, via
``psu_weights[:, psu_cell]`` broadcast. Bootstrap
distribution becomes ``W_cell @ u_cell / divisor``.
3. For the **multi-horizon** block, draws a SINGLE shared
``(n_bootstrap, n_psu)`` PSU-level weight matrix once and
broadcasts per-horizon via each horizon's cell-to-PSU map.
Preserves the sup-t simultaneous confidence band as a
valid joint distribution across horizons.
Under PSU=group, the row-sum identity
``sum_{c in g} u_cell[c] == u_centered[g]`` makes the two
paths statistically equivalent; the dispatcher prefers the
legacy path in that regime for bit-identity with pre-PR-4
releases via the identity-map fast path.
Parameters
----------
n_groups_for_overall : int
Number of groups contributing to the overall ``DID_M``
(length of ``u_centered_overall``). Used for shape
validation and weight-matrix sizing.
u_centered_overall : np.ndarray
Cohort-centered per-group influence-function values for
``DID_M``. Shape: ``(n_groups_for_overall,)``.
divisor_overall : int
Re-aggregation **divisor** for ``DID_M`` — the switching-
cell count ``N_S = sum_t (N_{1,0,t} + N_{0,1,t})`` from
Theorem 3 of AER 2020. NOT a group count. For Phase 1
this is the same value used in the analytical SE plug-in.
original_overall : float
The original point estimate of ``DID_M``. Used by
:func:`compute_effect_bootstrap_stats` for the percentile
p-value computation.
joiners_inputs : tuple, optional
``(u_centered, divisor, original_effect, u_per_period)``
4-tuple for the joiners-only ``DID_+`` target. The
``divisor`` is the joiner switching-cell total
``sum_t N_{1,0,t}`` (NOT the joiner group count).
``u_per_period`` is the cohort-recentered per-cell IF
tensor of shape ``(n_eligible_groups, n_periods)`` (or
``None`` when survey is inactive); it is consumed only
by the cell-level dispatch described below.
leavers_inputs : tuple, optional
Same 4-tuple for the leavers-only ``DID_-`` target.
``divisor`` is ``sum_t N_{0,1,t}``.
placebo_inputs : tuple, optional
Same 4-tuple for the Phase 1 per-period placebo
``DID_M^pl``. ``None`` when ``L_max=None`` (per-period
placebo has no IF). The 4th slot is always ``None`` here
because the Phase 1 placebo has no per-cell attribution.
multi_horizon_inputs : dict, optional
Per-horizon 4-tuples keyed by horizon ``l``. Same layout
as ``joiners_inputs``; the per-horizon per-cell tensor
drives the shared-PSU-weights broadcast under the cell-
level path.
placebo_horizon_inputs : dict, optional
Same shape as ``multi_horizon_inputs`` for the Phase 2
placebo-horizon targets.
group_id_to_psu_code : dict, optional
Group-ID → dense PSU code map for the legacy group-level
bootstrap path. ``None`` when no PSU info is available.
eligible_group_ids : np.ndarray, optional
Ordered group IDs matching the ``u_centered_*`` vectors
for the legacy group-level path.
u_per_period_overall : np.ndarray, optional
Cohort-recentered per-cell IF tensor for the overall
``DID_M`` target, shape ``(n_eligible_groups, n_periods)``.
Consumed only by the cell-level path.
psu_codes_per_cell : np.ndarray, optional
Per-cell PSU code tensor, shape
``(n_eligible_groups, n_periods)``, with ``-1`` sentinel
for ineligible / zero-weight cells. Used by
``_psu_varies_within_group`` to pick between the legacy
group-level bootstrap path (PSU-within-group-constant,
bit-identical to pre-PR-4 behavior via the identity-map
fast path) and the cell-level wild PSU bootstrap
(within-group-varying PSU, cell-level multipliers and
shared PSU-level weights across multi-horizon blocks).
Returns
-------
DCDHBootstrapResults
Populated bootstrap-results dataclass. Fields for unavailable
targets (joiners / leavers / placebo) are ``None``.
"""
if self.n_bootstrap <= 0:
raise ValueError(
f"_compute_dcdh_bootstrap called with n_bootstrap={self.n_bootstrap}; "
"must be > 0."
)
if u_centered_overall.ndim != 1:
raise ValueError(
"u_centered_overall must be a 1-D array of per-group influence "
f"function values, got shape {u_centered_overall.shape}"
)
if u_centered_overall.shape[0] != n_groups_for_overall:
raise ValueError(
f"u_centered_overall length ({u_centered_overall.shape[0]}) does not "
f"match n_groups_for_overall ({n_groups_for_overall})"
)
rng = np.random.default_rng(self.seed)
# PSU label for each bootstrap weight column is derived from
# the group's ID via `_map_for_target`, not from positional
# truncation. All current dCDH bootstrap targets use the
# variance-eligible group ordering (`eligible_group_ids`); if a
# future target uses a different ordering, add a dedicated
# group-IDs parameter for it rather than reusing the overall
# eligible list.
# Dispatcher for the cell-level wild PSU bootstrap. When PSU
# varies across cells of a group, a group-level PSU map
# collapses within-group PSU variation and the bootstrap
# under-clusters. The cell-level path draws multipliers at
# PSU granularity and applies them per (g, t) cell via
# `psu_codes_per_cell`. Under PSU-within-group-constant
# regimes (including PSU=group and strictly-coarser PSU
# within-group-constant), the row-sum identity
# `sum_{c in g} u_cell[c] == u_centered[g]` makes the two
# paths statistically equivalent, and the dispatcher routes
# to the legacy path for bit-identity with pre-release
# behavior. See REGISTRY.md survey + bootstrap contract Note.
psu_varies = _psu_varies_within_group(psu_codes_per_cell)
# --- Overall DID_M ---
# Skip the scalar DID_M bootstrap when divisor_overall <= 0
# (e.g., pure non-binary panels where N_S=0), but continue
# to process multi_horizon_inputs and placebo_horizon_inputs.
if divisor_overall > 0:
if psu_varies:
# Contract: when the cell-level path is required
# (psu_varies=True), the caller MUST provide the
# target's per-cell IF tensor. Silent fallback to the
# group-level allocator would under-cluster the
# bootstrap by collapsing multi-PSU group
# contributions to one PSU.
if u_per_period_overall is None:
raise ValueError(
"Cell-level bootstrap requires "
"u_per_period_overall when PSU varies "
"within group, got None. Caller must "
"thread the cohort-recentered per-cell IF "
"tensor (U_centered_pp_overall) from the "
"analytical TSL path."
)
u_boot_overall, map_boot_overall = _unroll_target_to_cells(
u_per_period_overall,
psu_codes_per_cell,
)
else:
u_boot_overall = u_centered_overall
map_boot_overall = _map_for_target(
u_centered_overall.shape[0],
group_id_to_psu_code,
eligible_group_ids,
)
overall_se, overall_ci, overall_p, overall_dist = _bootstrap_one_target(
u_centered=u_boot_overall,
divisor=divisor_overall,
original=original_overall,
n_bootstrap=self.n_bootstrap,
weight_type=self.bootstrap_weights,
alpha=self.alpha,
rng=rng,
context="dCDH overall DID_M bootstrap",
return_distribution=True,
group_to_psu_map=map_boot_overall,
)
else:
overall_se = np.nan
overall_ci = (np.nan, np.nan)
overall_p = np.nan
overall_dist = None
results = DCDHBootstrapResults(
n_bootstrap=self.n_bootstrap,
weight_type=self.bootstrap_weights,
alpha=self.alpha,
overall_se=overall_se,
overall_ci=overall_ci,
overall_p_value=overall_p,
bootstrap_distribution=overall_dist,
)
# --- Joiners (DID_+) ---
if joiners_inputs is not None:
u_j, n_j, eff_j, u_pp_j = joiners_inputs
if u_j.size > 0 and n_j > 0:
if psu_varies:
if u_pp_j is None:
raise ValueError(
"Cell-level bootstrap requires joiners' "
"per-cell IF tensor (U_centered_pp_joiners) "
"when PSU varies within group, got None."
)
u_boot_j, map_boot_j = _unroll_target_to_cells(
u_pp_j,
psu_codes_per_cell,
)
else:
u_boot_j = u_j
map_boot_j = _map_for_target(
u_j.size,
group_id_to_psu_code,
eligible_group_ids,
)
se_j, ci_j, p_j, _ = _bootstrap_one_target(
u_centered=u_boot_j,
divisor=n_j,
original=eff_j,
n_bootstrap=self.n_bootstrap,
weight_type=self.bootstrap_weights,
alpha=self.alpha,
rng=rng,
context="dCDH joiners DID_+ bootstrap",
return_distribution=False,
group_to_psu_map=map_boot_j,
)
results.joiners_se = se_j
results.joiners_ci = ci_j
results.joiners_p_value = p_j
# --- Leavers (DID_-) ---
if leavers_inputs is not None:
u_l, n_l, eff_l, u_pp_l = leavers_inputs
if u_l.size > 0 and n_l > 0:
if psu_varies:
if u_pp_l is None:
raise ValueError(
"Cell-level bootstrap requires leavers' "
"per-cell IF tensor (U_centered_pp_leavers) "
"when PSU varies within group, got None."
)
u_boot_l, map_boot_l = _unroll_target_to_cells(
u_pp_l,
psu_codes_per_cell,
)
else:
u_boot_l = u_l
map_boot_l = _map_for_target(
u_l.size,
group_id_to_psu_code,
eligible_group_ids,
)
se_l, ci_l, p_l, _ = _bootstrap_one_target(
u_centered=u_boot_l,
divisor=n_l,
original=eff_l,
n_bootstrap=self.n_bootstrap,
weight_type=self.bootstrap_weights,
alpha=self.alpha,
rng=rng,
context="dCDH leavers DID_- bootstrap",
return_distribution=False,
group_to_psu_map=map_boot_l,
)
results.leavers_se = se_l
results.leavers_ci = ci_l
results.leavers_p_value = p_l
# --- Placebo (DID_M^pl) ---
# Phase 1 placebo has no per-cell IF (unpack tolerates None in
# the fourth slot — callers always pass None for this target).
if placebo_inputs is not None:
u_pl, n_pl, eff_pl, _u_pp_pl_unused = placebo_inputs
if u_pl.size > 0 and n_pl > 0:
se_pl, ci_pl, p_pl, _ = _bootstrap_one_target(
u_centered=u_pl,
divisor=n_pl,
original=eff_pl,
n_bootstrap=self.n_bootstrap,
weight_type=self.bootstrap_weights,
alpha=self.alpha,
rng=rng,
context="dCDH placebo DID_M^pl bootstrap",
return_distribution=False,
group_to_psu_map=_map_for_target(
u_pl.size,
group_id_to_psu_code,
eligible_group_ids,
),
)
results.placebo_se = se_pl
results.placebo_ci = ci_pl
results.placebo_p_value = p_pl
# --- Phase 2: Multi-horizon bootstrap with shared weight matrix ---
# Generate ONE shared weight matrix so all horizons use the same
# bootstrap draw, making the sup-t statistic a valid joint
# multiplier-bootstrap band. Under PSU-within-group-constant the
# shared draws live at group granularity (bit-identical to
# pre-cell-level); under within-group-varying PSU the shared
# draws live at PSU granularity and are broadcast per-horizon to
# the horizon's cells via `psu_codes_per_cell`.
if multi_horizon_inputs is not None:
es_ses: Dict[int, float] = {}
es_cis: Dict[int, Tuple[float, float]] = {}
es_pvals: Dict[int, float] = {}
es_dists: Dict[int, np.ndarray] = {}
n_groups_mh = n_groups_for_overall
if psu_varies:
# Draw ONE shared (n_bootstrap, n_psu) PSU-level weight
# matrix. Broadcast per-horizon via each horizon's
# cell-to-PSU map inside the loop. PSU count derived
# from the dense code domain of psu_codes_per_cell.
assert psu_codes_per_cell is not None
valid_psu_codes = psu_codes_per_cell[psu_codes_per_cell >= 0]
n_psu_mh = int(valid_psu_codes.max()) + 1 if valid_psu_codes.size > 0 else 0
shared_psu_weights: Optional[np.ndarray]
if n_psu_mh > 0:
shared_psu_weights = _generate_bootstrap_weights_batch(
n_bootstrap=self.n_bootstrap,
n_units=n_psu_mh,
weight_type=self.bootstrap_weights,
rng=rng,
)
else:
shared_psu_weights = None
shared_weights = None # not used on the cell path
else:
# Shared weight matrix sized for the group set. Under
# PSU-within-group-constant (Hall-Mammen wild PSU),
# weights are drawn once per PSU and broadcast to groups
# so all groups in the same PSU share a multiplier
# within a single bootstrap replicate — preserving the
# sup-t joint distribution across horizons.
shared_weights = _generate_psu_or_group_weights(
n_bootstrap=self.n_bootstrap,
n_groups_target=n_groups_mh,
weight_type=self.bootstrap_weights,
rng=rng,
group_to_psu_map=_map_for_target(
n_groups_mh,
group_id_to_psu_code,
eligible_group_ids,
),
)
shared_psu_weights = None
for l_h, (u_h, n_h, eff_h, u_pp_h) in sorted(multi_horizon_inputs.items()):
if u_h.size > 0 and n_h > 0:
# Under the current contract every horizon's IF
# vector uses the variance-eligible group ordering
# from `eligible_group_ids`, so the shared weight
# matrix is already at the right shape. Assert
# this invariant so any future refactor that
# introduces horizon-specific masking fails loudly
# rather than silently misaligning PSU clusters via
# positional truncation.
if u_h.size != n_groups_mh:
raise ValueError(
f"Multi-horizon bootstrap: horizon {l_h} "
f"IF vector has {u_h.size} entries but "
f"shared weight matrix has {n_groups_mh} "
f"columns. dCDH's contract requires every "
f"horizon to use the variance-eligible "
f"group ordering; to support a horizon "
f"with a different ordering, thread "
f"target-specific group IDs through "
f"`multi_horizon_inputs` and project the "
f"shared PSU draws onto the horizon's own "
f"ordering via `_map_for_target`."
)
if psu_varies:
if u_pp_h is None:
raise ValueError(
f"Cell-level bootstrap requires "
f"per-cell IF tensor for multi-horizon "
f"target l={l_h} when PSU varies "
f"within group, got None."
)
assert shared_psu_weights is not None
# Cell-level: unroll this horizon's cells and
# broadcast the shared PSU weights.
u_cell_h, psu_cell_h = _unroll_target_to_cells(
u_pp_h,
psu_codes_per_cell,
)
if u_cell_h.size == 0:
continue
w_cell_h = shared_psu_weights[:, psu_cell_h]
deviations = (w_cell_h @ u_cell_h) / n_h
else:
assert shared_weights is not None
deviations = (shared_weights @ u_h) / n_h
dist_h = deviations + eff_h
se_h, ci_h, p_h = _compute_effect_bootstrap_stats(
original_effect=eff_h,
boot_dist=dist_h,
alpha=self.alpha,
)
es_ses[l_h] = se_h
es_cis[l_h] = ci_h
es_pvals[l_h] = p_h
es_dists[l_h] = dist_h
results.event_study_ses = es_ses
results.event_study_cis = es_cis
results.event_study_p_values = es_pvals
# Sup-t simultaneous confidence bands using the shared draws.
valid_horizons = [
l_h
for l_h in es_dists
if l_h in es_ses and np.isfinite(es_ses[l_h]) and es_ses[l_h] > 0
]
if len(valid_horizons) >= 2:
boot_matrix = np.array([es_dists[l_h] for l_h in valid_horizons])
effects_vec = np.array([multi_horizon_inputs[l_h][2] for l_h in valid_horizons])
ses_vec = np.array([es_ses[l_h] for l_h in valid_horizons])
t_stats = np.abs((boot_matrix - effects_vec[:, None]) / ses_vec[:, None])
sup_t_dist = np.max(t_stats, axis=0)
finite_mask = np.isfinite(sup_t_dist)
if finite_mask.sum() > 0.5 * self.n_bootstrap:
cband_crit = float(np.quantile(sup_t_dist[finite_mask], 1 - self.alpha))
results.cband_crit_value = cband_crit
# --- Phase 2: Placebo horizon bootstrap ---
# Note: placebo-horizons are treated as independent single-target
# draws (no sup-t joint-distribution requirement across placebo
# horizons), so each horizon gets its own RNG draw via
# `_bootstrap_one_target`. Under within-group-varying PSU the
# per-horizon cell unroll is used.
if placebo_horizon_inputs is not None:
pl_ses: Dict[int, float] = {}
pl_cis: Dict[int, Tuple[float, float]] = {}
pl_pvals: Dict[int, float] = {}
for l_h, (u_h, n_h, eff_h, u_pp_h) in sorted(placebo_horizon_inputs.items()):
if u_h.size > 0 and n_h > 0:
if psu_varies:
if u_pp_h is None:
raise ValueError(
f"Cell-level bootstrap requires "
f"per-cell IF tensor for placebo-"
f"horizon target l={l_h} when PSU "
f"varies within group, got None."
)
u_boot_plh, map_boot_plh = _unroll_target_to_cells(
u_pp_h,
psu_codes_per_cell,
)
if u_boot_plh.size == 0:
continue
else:
u_boot_plh = u_h
map_boot_plh = _map_for_target(
u_h.size,
group_id_to_psu_code,
eligible_group_ids,
)
se_h, ci_h, p_h, _ = _bootstrap_one_target(
u_centered=u_boot_plh,
divisor=n_h,
original=eff_h,
n_bootstrap=self.n_bootstrap,
weight_type=self.bootstrap_weights,
alpha=self.alpha,
rng=rng,
context=f"dCDH placebo l={l_h} bootstrap",
return_distribution=False,
group_to_psu_map=map_boot_plh,
)
pl_ses[l_h] = se_h
pl_cis[l_h] = ci_h
pl_pvals[l_h] = p_h
results.placebo_horizon_ses = pl_ses
results.placebo_horizon_cis = pl_cis
results.placebo_horizon_p_values = pl_pvals
# --- Phase 3: Per-path (by_path) bootstrap ---
# Treated as independent single-target draws per (path, horizon).
# No shared-weight / sup-t joint distribution across paths (or
# across horizons within a path) in this release — that's
# Wave 2 item 4 follow-up work. Each target follows the same
# `_bootstrap_one_target` dispatch as the single-horizon paths
# above; the survey cell-level dispatch path is not reachable
# here because the `by_path + survey_design` combination is
# gated out in fit() before bootstrap is invoked.
if path_bootstrap_inputs is not None:
path_ses: Dict[Tuple[int, ...], Dict[int, float]] = {}
path_cis: Dict[Tuple[int, ...], Dict[int, Tuple[float, float]]] = {}
path_pvals: Dict[Tuple[int, ...], Dict[int, float]] = {}
for path_key, horizon_inputs in path_bootstrap_inputs.items():
path_ses[path_key] = {}
path_cis[path_key] = {}
path_pvals[path_key] = {}
for l_h, (u_h, n_h, eff_h, _u_pp_h) in sorted(horizon_inputs.items()):
if u_h.size > 0 and n_h > 0:
map_boot_ph = _map_for_target(
u_h.size,
group_id_to_psu_code,
eligible_group_ids,
)
# np.errstate wrap: an identically-zero
# centered IF (degenerate path + horizon) would
# otherwise emit a RuntimeWarning: invalid
# value on the divide step in
# bootstrap_utils. The analytical pass has
# already emitted a UserWarning for the same
# (path, horizon); suppressing the
# RuntimeWarning here avoids stacked-warning
# noise without masking genuine numerical
# issues (`_bootstrap_one_target` still
# returns NaN SE for the zero-IF branch via
# the existing `se <= 0` guard in
# bootstrap_utils).
with np.errstate(invalid="ignore", divide="ignore"):
se_h, ci_h, p_h, _ = _bootstrap_one_target(
u_centered=u_h,
divisor=n_h,
original=eff_h,
n_bootstrap=self.n_bootstrap,
weight_type=self.bootstrap_weights,
alpha=self.alpha,
rng=rng,
context=f"dCDH by_path path={path_key} l={l_h} bootstrap",
return_distribution=False,
group_to_psu_map=map_boot_ph,
)
path_ses[path_key][l_h] = se_h
path_cis[path_key][l_h] = ci_h
path_pvals[path_key][l_h] = p_h
results.path_ses = path_ses
results.path_cis = path_cis
results.path_p_values = path_pvals
# --- Phase 3: Per-path placebo (by_path + placebo) bootstrap ---
# Sibling of the per-path event-study block above for backward
# placebo lags. Same independent single-target dispatch per
# (path, lag_l) via `_bootstrap_one_target`; the survey cell-
# level path is unreachable here because
# `by_path + survey_design` is gated out in fit() before
# bootstrap is invoked. The `np.errstate` wrap mirrors the
# event-study block's degenerate-IF stacked-warning suppression.
if path_placebo_bootstrap_inputs is not None:
path_pl_ses: Dict[Tuple[int, ...], Dict[int, float]] = {}
path_pl_cis: Dict[Tuple[int, ...], Dict[int, Tuple[float, float]]] = {}
path_pl_pvals: Dict[Tuple[int, ...], Dict[int, float]] = {}
for path_key, horizon_inputs in path_placebo_bootstrap_inputs.items():
path_pl_ses[path_key] = {}
path_pl_cis[path_key] = {}
path_pl_pvals[path_key] = {}
for lag_l, (u_pl, n_pl, eff_pl, _u_pp_pl) in sorted(horizon_inputs.items()):
if u_pl.size > 0 and n_pl > 0:
map_boot_pl_path = _map_for_target(
u_pl.size,
group_id_to_psu_code,
eligible_group_ids,
)
with np.errstate(invalid="ignore", divide="ignore"):
(
se_pl_h,
ci_pl_h,
p_pl_h,
_,
) = _bootstrap_one_target(
u_centered=u_pl,
divisor=n_pl,
original=eff_pl,
n_bootstrap=self.n_bootstrap,
weight_type=self.bootstrap_weights,
alpha=self.alpha,
rng=rng,
context=(
f"dCDH by_path placebo " f"path={path_key} l={lag_l} bootstrap"
),
return_distribution=False,
group_to_psu_map=map_boot_pl_path,
)
path_pl_ses[path_key][lag_l] = se_pl_h
path_pl_cis[path_key][lag_l] = ci_pl_h
path_pl_pvals[path_key][lag_l] = p_pl_h
results.path_placebo_ses = path_pl_ses
results.path_placebo_cis = path_pl_cis
results.path_placebo_p_values = path_pl_pvals
# --- Phase 3: Per-path joint sup-t (by_path + n_bootstrap > 0) ---
# Sibling of the OVERALL event-study sup-t at the multi-horizon
# block above (`:599-614`). Per-path joint simultaneous
# confidence bands across horizons 1..L_max within each path:
# one shared (n_bootstrap, n_eligible) multiplier weight matrix
# (using `self.bootstrap_weights` — Rademacher / Mammen / Webb)
# per path is broadcast across all valid horizons of that path,
# producing correlated bootstrap distributions across horizons.
# The path-specific critical value
# `c_p = quantile(max_l |t_l|, 1-alpha)` is the band half-width
# multiplier applied to each horizon's bootstrap SE in fit().
#
# Note (asymmetry vs OVERALL): this draws a FRESH shared-weights
# matrix per path AFTER the per-path SE block above has populated
# results.path_ses via independent per-(path, horizon) draws.
# Numerator: fresh shared draws; denominator: bootstrap SEs from
# the earlier independent draws. Asymptotically equivalent to
# OVERALL's self-consistent reuse, but NOT bit-identical. The
# fresh draw is intentional: it preserves RNG-state isolation
# for existing per-path SE seed-reproducibility tests.
#
# Gates: a path needs >=2 valid horizons (finite bootstrap SE>0)
# AND a strict majority (>50%) of finite sup-t draws to receive
# a band. Otherwise the path is absent from
# path_cband_crit_values (mirrors OVERALL absent-key pattern at
# `:605,612`; the strict-majority gate matches the OVERALL
# `finite_mask.sum() > 0.5 * n_bootstrap` semantics — exactly
# half finite is NOT enough).
if path_bootstrap_inputs is not None and results.path_ses:
path_cband_crits: Dict[Tuple[int, ...], float] = {}
path_cband_n_valid: Dict[Tuple[int, ...], int] = {}
for path_key, horizon_inputs in path_bootstrap_inputs.items():
bs_ses_for_path = results.path_ses.get(path_key, {})
valid_horizons = []
for l_h, (u_h, n_h, eff_h, _u_pp_h) in sorted(horizon_inputs.items()):
if u_h.size == 0 or n_h <= 0:
continue
bs_se = bs_ses_for_path.get(l_h, np.nan)
if not np.isfinite(bs_se) or bs_se <= 0:
continue
valid_horizons.append((l_h, u_h, n_h, eff_h, bs_se))
if len(valid_horizons) < 2:
continue
# All horizons within a path use the same n_eligible
# (variance-eligible group ordering enforced by
# _collect_path_bootstrap_inputs's use of
# eligible_mask_var for cohort-recentering); use the
# first valid horizon's IF size as the shared dim.
n_dim = valid_horizons[0][1].size
map_path = _map_for_target(
n_dim,
group_id_to_psu_code,
eligible_group_ids,
)
with np.errstate(invalid="ignore", divide="ignore"):
shared_weights = _generate_psu_or_group_weights(
n_bootstrap=self.n_bootstrap,
n_groups_target=n_dim,
weight_type=self.bootstrap_weights,
rng=rng,
group_to_psu_map=map_path,
)
es_dists_path = []
for _l_h, u_h, n_h, eff_h, _bs_se in valid_horizons:
deviations = (shared_weights @ u_h) / n_h
es_dists_path.append(eff_h + deviations)
boot_matrix = np.asarray(es_dists_path)
effects_vec = np.array([v[3] for v in valid_horizons])
ses_vec = np.array([v[4] for v in valid_horizons])
t_stats = np.abs((boot_matrix - effects_vec[:, None]) / ses_vec[:, None])
sup_t_dist = np.max(t_stats, axis=0)
finite_mask = np.isfinite(sup_t_dist)
if finite_mask.sum() <= 0.5 * self.n_bootstrap:
continue
crit_p = float(np.quantile(sup_t_dist[finite_mask], 1.0 - self.alpha))
if not np.isfinite(crit_p):
continue
path_cband_crits[path_key] = crit_p
path_cband_n_valid[path_key] = len(valid_horizons)
results.path_cband_crit_values = path_cband_crits
results.path_cband_n_valid_horizons = path_cband_n_valid
return results
# =============================================================================
# Internal helpers
# =============================================================================
def _psu_varies_within_group(
psu_codes_per_cell: Optional[np.ndarray],
) -> bool:
"""True when any row of ``psu_codes_per_cell`` has more than one
unique PSU label (ignoring -1 sentinel entries).
When ``False`` — including the ``None`` case for non-survey fits —
the legacy group-level bootstrap path is invoked. The row-sum
identity ``sum_{c in g} u_cell[c] == u_centered[g]`` established
by ``_cohort_recenter_per_period`` makes the cell-level and
group-level bootstraps statistically equivalent under this regime,
and the group-level path is bit-identical to pre-cell-level
releases through the existing identity-map fast path.
"""
if psu_codes_per_cell is None:
return False
for row in psu_codes_per_cell:
valid = row[row >= 0]
if valid.size > 1 and np.unique(valid).size > 1:
return True
return False
def _unroll_target_to_cells(
u_per_period_target: np.ndarray,
psu_codes_per_cell: Optional[np.ndarray],
) -> Tuple[np.ndarray, np.ndarray]:
"""Flatten a target's cohort-recentered per-cell IF tensor + its
per-cell PSU map into 1-D arrays, dropping cells with sentinel
PSU code (-1 — zero-weight cells).
Both inputs must have shape ``(n_eligible_groups, n_periods)``;
the dCDH bootstrap contract guarantees all targets share the
variance-eligible group ordering, so no per-target row subset
is needed.
Raises ``ValueError`` when any sentinel cell (-1 PSU) carries
non-zero cohort-recentered IF mass. This is a supported-edge-
case guard: under terminal missingness, ``_cohort_recenter_per_period``
subtracts column means across the full period grid, so a group
with no observation at period ``t`` can acquire non-zero centered
mass at that sentinel cell. The cell-level bootstrap cannot
allocate that mass to any PSU (the cell has no positive-weight
obs), so silently dropping it would under-weight the group's
bootstrap contribution. The analytical TSL path shares the same
cell-period allocator and fires a matching guard in
``_survey_se_from_group_if``, so both paths reject this regime
consistently. **Documented workaround (bootstrap path):**
pre-process the panel to remove terminal missingness (drop
late-exit groups or trim to a balanced sub-panel). The within-
group-varying-PSU bootstrap has no allocator fallback — unlike
Binder TSL, where using an explicit ``psu=<group_col>`` routes
the analytical path through the legacy group-level allocator;
that fallback is not available on the bootstrap side because
the cell-level wild PSU bootstrap IS the varying-PSU regime.
Returns ``(u_cell, psu_cell)`` of shape
``(n_valid_cells_in_target,)`` each.
"""
if psu_codes_per_cell is None:
raise ValueError(
"_unroll_target_to_cells requires psu_codes_per_cell; "
"caller should only invoke this on the cell-level path."
)
if u_per_period_target.shape != psu_codes_per_cell.shape:
raise ValueError(
"Cell-level bootstrap shape mismatch: target per-period "
f"IF tensor has shape {u_per_period_target.shape} but "
f"psu_codes_per_cell has shape {psu_codes_per_cell.shape}. "
"The dCDH bootstrap contract requires every target's "
"per-cell tensor to align with the (n_eligible_groups, "
"n_periods) layout of psu_codes_per_cell."
)
flat_u = u_per_period_target.ravel()
flat_psu = psu_codes_per_cell.ravel()
mask = flat_psu >= 0
# Sentinel-mass guard: reject terminal-missingness + within-group-
# varying PSU + bootstrap. The cohort-recentering column-subtraction
# at `_cohort_recenter_per_period` can leak non-zero centered mass
# onto cells with no positive-weight obs (missing-cell rows in the
# cohort still get -col_mean added when other rows contribute at
# that column). Dropping that mass silently would under-cluster the
# bootstrap in a supported panel regime.
sentinel_mass = flat_u[~mask]
if sentinel_mass.size > 0 and bool(np.any(np.abs(sentinel_mass) > 1e-12)):
raise ValueError(
"Cell-level bootstrap cannot be computed on this survey "
"panel: cohort-recentered IF mass landed on cells with "
"no positive-weight observations (psu_codes_per_cell == "
"-1). This typically occurs when terminal missingness "
"(groups observed only through some period) combines with "
"within-group-varying PSU: `_cohort_recenter_per_period` "
"subtracts column means across the full period grid, so a "
"group with no observation at period t acquires non-zero "
"centered mass there, which the cell-level allocator "
"cannot allocate to any PSU. The analytical TSL path "
"(`_survey_se_from_group_if`) fires a matching guard on "
"the same regime, so both paths reject this panel "
"consistently. Pre-process the panel to remove terminal "
"missingness (drop late-exit groups or trim to a balanced "
"sub-panel). The within-group-varying-PSU bootstrap has "
"no allocator fallback — unlike Binder TSL, switching to "
"`psu=<group_col>` does not help here because the varying-"
"PSU bootstrap IS the cell-level path, not an analytical "
"surface with a legacy-allocator alternative."
)
return flat_u[mask], flat_psu[mask].astype(np.int64, copy=False)
def _map_for_target(
target_size: int,
group_id_to_psu_code: Optional[Dict[Any, int]],
eligible_group_ids: Optional[np.ndarray],
) -> Optional[np.ndarray]:
"""Build a PSU map for a bootstrap target from IDs (not positions).
The caller passes:
- ``group_id_to_psu_code``: a dict mapping each variance-eligible
group ID to its dense PSU code (built once in ``fit()``).
- ``eligible_group_ids``: the ordered list of group IDs that
correspond to the current target's ``u_centered`` vector.
Returns an integer array of length ``target_size`` where entry
``i`` is the PSU code for the ``i``-th contributing group.