Predictor — learning loop
Aratea is a weather-factor discovery engine. Every named feature here is a hypothesis; every training run measures whether it carries signal. The bench is the same row-set kalshi_mid Brier — beat the market, on its own ground.
Everything the manifest carries: named factors with their leave-one-out delta, paper-trade ledger, training runs and Brier trajectory. This is the meteorologist / actuary view — no rounding, no sugar-coating.
Manifest generated at 2026-05-19T19:54:15Z (schema v3).
Hybrid effective sample (N_eff)
α = 0.3N_eff drives secondary decisions only — feature-set selection, reliability plots, complementary promotion check. The Phase 1 go/no-go gate stays strictly on N_live; backtest volume never substitutes for live trades there.
Read CONVENTION §6.bisA. Live runs (Kalshi paper trades)
Each row is a real paper trade on Kalshi. The champion takes the position (real ledger row, real P&L); challengers and baselines run in shadow mode for Brier comparison. ★ marks the best Brier on a given run. The promotion rule (champion swap) needs a rolling-mean Brier dominance over N≥10 resolved trades — single-run wins are anecdotal.
| Run | When | Event / Bin | Side | Champion p | Challenger p | Baseline p | kalshi_mid | Outcome | P&L paper |
|---|---|---|---|---|---|---|---|---|---|
| 010 | 2026-05-18 | LOWTLAX 19/5B57.5 | NO | 25.6% | 14.4% | 45.5% | 45.5% | PENDING | — |
| 009 | 2026-05-18 | LOWTNYC 19/5B69.5 | NO | 7.6% | 17.2% | 14.5% | 14.5% | PENDING | — |
| 008 | 2026-05-18 | LOWTNYC 19/5B71.5 | NO | 6.2% | 17.0% | 40.0% | 40.0% | PENDING | — |
| 007 | 2026-05-16 | LOWTNYC 17/5B64.5 | NO | 8.4%B=0.0070 ★ | 16.0%B=0.0255 | 35.5%B=0.1260 | 35.5% | WIN (NO) | +$55.02 |
| 006 | 2026-05-15 | LOWTNYC 16/5B53.5 | NO | 16.8%B=0.0284 ★ | 21.9%B=0.0478 | 39.5%B=0.1560 | 39.5% | WIN (NO) | +$65.17 |
| 005 | 2026-05-14 | LOWTNYC 15/5B49.5 | NO | 14.9%B=0.7240 | 18.6%B=0.6626 | 34.5%B=0.4290 ★ | 34.5% | LOSS (YES) | −$99.56 |
| 004 | 2026-05-13 | LOWTNYC 14/5B52.5 | NO | 14.1%B=0.7385 | 17.1%B=0.6873 | 32.0%B=0.4624 ★ | 32.0% | LOSS (YES) | −$99.96 |
| 003 | 2026-05-12 | LOWTNYC 13/5B51.5 | NO | 14.0%B=0.0196 ★ | 18.0%B=0.0326 | 34.5%B=0.1190 | 34.5% | WIN (NO) | +$52.44 |
| 002 | 2026-05-10 | LOWTNYC 11/5B50.5 | NO | 14.6%B=0.0213 ★ | — | 15.4%B=0.0237 | 36.0% | WIN (NO) | +$56.16 |
B. Named factors
Each row is a named hypothesis used by the learned predictor at training time. Brier Δ is the leave-one-out test delta from the most recent training run — sort by it to see what carried the model.
| Name | Hypothesis | Source | Added | Brier Δ ↓ | Status |
|---|---|---|---|---|---|
| p_ensemble | Mean of four vendor probabilities (ECMWF + GraphCast + GFS + JMA). Hypothesis: vendor disagreement washes out, the mean is the wisest single bet. (Bench 2026-05-11 N=138: ensemble Brier 0.1429 vs kalshi_mid 0.0845 — the average **lost** to the market, so we need to learn weights instead of averaging blindly.) | derived from `predictors/ensemble.py` | 2026-05-09 | ↑ +0.0041 | active |
| p_climatology | Historical base rate of (variable in [lower, upper]) over the same date-of-year window from the past 15 years. The dumb-but-honest prior every forecast must beat. | derived from `predictors/climatology.py` (Open-Meteo historical) | 2026-05-09 | ↑ +0.0015 | experimental |
| forecast_spread | Max − min of the per-vendor probabilities (proxy of model disagreement). Hypothesis: when vendors disagree, the prediction is less trustworthy and the market mid carries more weight than the model. | derived from `predictions.ensemble.inputs.individual_probs` | 2026-05-09 | ↑ +0.0008 | active |
| urban_density_5km | OSM `way["building"]` count within 5 km of the station. Hypothesis: urban heat island raises overnight lows above what a non-urban climatology predicts → biases low-temp markets in cities. Units: building count (not %-area; see README for why). | OSM Overpass API | 2026-05-11 | ↑ +0.0000 | experimental |
| elevation_m | USGS EPQS elevation at the station point. Hypothesis: thinner air at altitude amplifies the diurnal swing (Denver KDEN ~1638 m vs. Miami KMIA ~2 m at the extremes of our station set). | https://epqs.nationalmap.gov/v1/json | 2026-05-11 | ↑ +0.0000 | experimental |
| latitude | Station latitude (degrees, signed). Hypothesis: insolation, daylight length, and seasonal amplitude scale with `cos(latitude)` — explicit feature lets the learner discover the interaction with the date-of-year encoded in climatology. | NWS_STATIONS table | 2026-05-11 | ↑ +0.0000 | experimental |
| forest_pct_5km | OSM `natural=wood` + `landuse=forest` feature count within 5 km. Hypothesis: canopy cover lowers daytime highs (shade + evapotranspiration) and limits radiative night cooling (canopy traps). Units: feature count. | OSM Overpass API | 2026-05-11 | ↑ +0.0000 | experimental |
| water_pct_10km | OSM `natural=water` + `waterway=*` feature count within 10 km. Hypothesis: large water bodies dampen diurnal swings via thermal inertia → tightens the [lower, upper] hit probability for both highs and lows. Units: feature count (kept the `_pct_` name from the spec for continuity). | OSM Overpass API | 2026-05-11 | ↓ −0.0000 | experimental |
| distance_to_coast_km | Haversine distance to the nearest Natural Earth 1:50m coastline vertex. Hypothesis: maritime regime (Boston, Miami, SFO) damps extremes; continental regime (Denver, Oklahoma City) amplifies them. | Natural Earth `ne_50m_coastline.geojson` | 2026-05-11 | ↓ −0.0000 | experimental |
| days_ahead | Days between snapshot and target_date. Hypothesis: forecast skill decays with horizon, learned weights should interact non-linearly with this. | derived from `predictions.forecast_blend.inputs.days_ahead` | 2026-05-09 | ↓ −0.0000 | experimental |
| p_forecast_blend | Open-Meteo deterministic forecast around target_date, blended with climatology by horizon. Hypothesis: state-of-art deterministic forecast carries calibrated short-horizon signal. | derived from `predictors/forecast_blend.py` | 2026-05-09 | ↓ −0.0021 | active |
| p_nws_ndfd | P(YES) computed from the NWS NDFD official forecast, gaussian around NDFD temp with sigma from climatology range. Hypothesis: the agency that *resolves* Kalshi weather markets (NWS Climatological Report Daily) should issue the highest-signal forecast available. | https://api.weather.gov | 2026-05-11 | TBD (forward-only — no historical coverage yet) | experimental |
Click a row for the full hypothesis, source link, and per-run history. Brier Δ is the leave-one-out test-Brier delta from the latest run — negative (↓) = feature carried signal, positive (↑) = net noise on this split.
C. Latest training run
Snapshot of the most recent sklearn fit of the learned predictor on historical resolutions. This is not a paper trade — it's a cross-validation pass to see whether the current feature set has edge over kalshi_mid on past Kalshi events.
Phase A.2 rerun under clean target_date split; supersedes invalidated 20260514T141925Z. Decision-gate run for the temporal-split fix.
D. Training run history
Every sklearn training pass, most recent first. This is not the paper-trade history — see section A for that. A training run with Brier test below Brier kalshi_mid on the same rows means the model has signal beyond the market mid in cross-validation.
| When (UTC) | Feature set | n_test | Brier test | Brier kalshi_mid | Gap | Verdict | Notes |
|---|---|---|---|---|---|---|---|
| 2026-05-14 19:19 UTC | v2 | 84 | 0.1323 | 0.1305 | +0.0018 | MARKET WINS | Phase A.2 rerun under clean target_date split; supersedes invalidated 20260514T141925Z. Decision-gate run for the temporal-split fix. |
| 2026-05-12 13:45 UTC | v2 | 65 | 0.1301 | 0.0764 | +0.0537 | MARKET WINS | schema v2: add intercept + feature_means/stds for live inference |
| 2026-05-11 13:08 UTC | v2 | 42 | 0.1260 | 0.0282 | +0.0978 | MARKET WINS | First A.3 discovery run. V2 = V0 baseline + 6 static geographic features (urban_density_5km, water_pct_10km, forest_pct_5km, elevation_m, distance_to_coast_km, latitude). N=138 resolved, train=96 (older), test=42 (newer). Test slice currently collapses to a single capture date (20260510T171217Z) — limits temporal variance; geographic deltas mostly null on this split as expected. |
E. Training Brier trajectory
Learned model (test) vs kalshi_mid (same test rows) across all training runs. Dashed horizontal line is the most recent kalshi_mid Brier as all-time reference; vertical dashed markers flag a feature-set bump (v0 → v1 → v2 …).
F. Backtest replays
Replayed records produced by backtest.py against settled Kalshi events. Only strict point-in-time records count toward N_backtest_strict; NAIVE-mode rows are flagged and excluded from the hybrid sample. Filters above narrow both this table and the live runs section.