aratea.dashboard

Predictor — learning loop

Aratea is a weather-factor discovery engine. Every named feature here is a hypothesis; every training run measures whether it carries signal. The bench is the same row-set kalshi_mid Brier — beat the market, on its own ground.

Everything the manifest carries: named factors with their leave-one-out delta, paper-trade ledger, training runs and Brier trajectory. This is the meteorologist / actuary view — no rounding, no sugar-coating.

Features tracked
12
Active
3
Experimental
9
Dropped
0
Paper bets (open / resolved)
3 / 6
Phase 1: 9/50

Manifest generated at 2026-05-19T19:54:15Z (schema v3).

Hybrid effective sample (N_eff)

α = 0.3
86.4= 6 + 0.3 × 268 = 86.4
N_live (real paper trades)
6
N_backtest_strict (replay, point-in-time)
268
NAIVE-excluded (informational)
0
Phase 1 gate (strict, live only): N_live ≥ 50. Currently 6/50.

N_eff drives secondary decisions only — feature-set selection, reliability plots, complementary promotion check. The Phase 1 go/no-go gate stays strictly on N_live; backtest volume never substitutes for live trades there.

Read CONVENTION §6.bis
Series
Status

A. Live runs (Kalshi paper trades)

Each row is a real paper trade on Kalshi. The champion takes the position (real ledger row, real P&L); challengers and baselines run in shadow mode for Brier comparison. ★ marks the best Brier on a given run. The promotion rule (champion swap) needs a rolling-mean Brier dominance over N≥10 resolved trades — single-run wins are anecdotal.

RunWhenEvent / BinSideChampion pChallenger pBaseline pkalshi_midOutcomeP&L paper
0102026-05-18LOWTLAX 19/5B57.5NO25.6%14.4%45.5%45.5%PENDING
0092026-05-18LOWTNYC 19/5B69.5NO7.6%17.2%14.5%14.5%PENDING
0082026-05-18LOWTNYC 19/5B71.5NO6.2%17.0%40.0%40.0%PENDING
0072026-05-16LOWTNYC 17/5B64.5NO8.4%B=0.007016.0%B=0.025535.5%B=0.126035.5%WIN (NO)+$55.02
0062026-05-15LOWTNYC 16/5B53.5NO16.8%B=0.028421.9%B=0.047839.5%B=0.156039.5%WIN (NO)+$65.17
0052026-05-14LOWTNYC 15/5B49.5NO14.9%B=0.724018.6%B=0.662634.5%B=0.429034.5%LOSS (YES)−$99.56
0042026-05-13LOWTNYC 14/5B52.5NO14.1%B=0.738517.1%B=0.687332.0%B=0.462432.0%LOSS (YES)−$99.96
0032026-05-12LOWTNYC 13/5B51.5NO14.0%B=0.019618.0%B=0.032634.5%B=0.119034.5%WIN (NO)+$52.44
0022026-05-10LOWTNYC 11/5B50.5NO14.6%B=0.021315.4%B=0.023736.0%WIN (NO)+$56.16
★ = best Brier this run · B = Brier score per model · P&L = champion only (challengers and baselines are shadow; no real exposure).

B. Named factors

Each row is a named hypothesis used by the learned predictor at training time. Brier Δ is the leave-one-out test delta from the most recent training run — sort by it to see what carried the model.

Name HypothesisSourceAddedBrier Δ Status
p_ensembleMean of four vendor probabilities (ECMWF + GraphCast + GFS + JMA). Hypothesis: vendor disagreement washes out, the mean is the wisest single bet. (Bench 2026-05-11 N=138: ensemble Brier 0.1429 vs kalshi_mid 0.0845 — the average **lost** to the market, so we need to learn weights instead of averaging blindly.)derived from `predictors/ensemble.py`2026-05-09
↑ +0.0041
active
p_climatologyHistorical base rate of (variable in [lower, upper]) over the same date-of-year window from the past 15 years. The dumb-but-honest prior every forecast must beat.derived from `predictors/climatology.py` (Open-Meteo historical)2026-05-09
↑ +0.0015
experimental
forecast_spreadMax − min of the per-vendor probabilities (proxy of model disagreement). Hypothesis: when vendors disagree, the prediction is less trustworthy and the market mid carries more weight than the model.derived from `predictions.ensemble.inputs.individual_probs`2026-05-09
↑ +0.0008
active
urban_density_5kmOSM `way["building"]` count within 5 km of the station. Hypothesis: urban heat island raises overnight lows above what a non-urban climatology predicts → biases low-temp markets in cities. Units: building count (not %-area; see README for why).OSM Overpass API2026-05-11
↑ +0.0000
experimental
elevation_mUSGS EPQS elevation at the station point. Hypothesis: thinner air at altitude amplifies the diurnal swing (Denver KDEN ~1638 m vs. Miami KMIA ~2 m at the extremes of our station set).https://epqs.nationalmap.gov/v1/json2026-05-11
↑ +0.0000
experimental
latitudeStation latitude (degrees, signed). Hypothesis: insolation, daylight length, and seasonal amplitude scale with `cos(latitude)` — explicit feature lets the learner discover the interaction with the date-of-year encoded in climatology.NWS_STATIONS table2026-05-11
↑ +0.0000
experimental
forest_pct_5kmOSM `natural=wood` + `landuse=forest` feature count within 5 km. Hypothesis: canopy cover lowers daytime highs (shade + evapotranspiration) and limits radiative night cooling (canopy traps). Units: feature count.OSM Overpass API2026-05-11
↑ +0.0000
experimental
water_pct_10kmOSM `natural=water` + `waterway=*` feature count within 10 km. Hypothesis: large water bodies dampen diurnal swings via thermal inertia → tightens the [lower, upper] hit probability for both highs and lows. Units: feature count (kept the `_pct_` name from the spec for continuity).OSM Overpass API2026-05-11
↓ −0.0000
experimental
distance_to_coast_kmHaversine distance to the nearest Natural Earth 1:50m coastline vertex. Hypothesis: maritime regime (Boston, Miami, SFO) damps extremes; continental regime (Denver, Oklahoma City) amplifies them.Natural Earth `ne_50m_coastline.geojson`2026-05-11
↓ −0.0000
experimental
days_aheadDays between snapshot and target_date. Hypothesis: forecast skill decays with horizon, learned weights should interact non-linearly with this.derived from `predictions.forecast_blend.inputs.days_ahead`2026-05-09
↓ −0.0000
experimental
p_forecast_blendOpen-Meteo deterministic forecast around target_date, blended with climatology by horizon. Hypothesis: state-of-art deterministic forecast carries calibrated short-horizon signal.derived from `predictors/forecast_blend.py`2026-05-09
↓ −0.0021
active
p_nws_ndfdP(YES) computed from the NWS NDFD official forecast, gaussian around NDFD temp with sigma from climatology range. Hypothesis: the agency that *resolves* Kalshi weather markets (NWS Climatological Report Daily) should issue the highest-signal forecast available.https://api.weather.gov2026-05-11
TBD (forward-only — no historical coverage yet)
experimental

Click a row for the full hypothesis, source link, and per-run history. Brier Δ is the leave-one-out test-Brier delta from the latest run — negative (↓) = feature carried signal, positive (↑) = net noise on this split.

C. Latest training run

Snapshot of the most recent sklearn fit of the learned predictor on historical resolutions. This is not a paper trade — it's a cross-validation pass to see whether the current feature set has edge over kalshi_mid on past Kalshi events.

Latest training run
feature set v2 · 2026-05-14 19:19 UTC
MARKET WINS
n_train
312
n_test
84
Brier train
0.1282
Brier test
0.1323
Brier kalshi_mid
0.1305
gap (test − kalshi_mid)
+0.0018
log-loss test
0.4222
log-loss kalshi_mid
0.4071
notes

Phase A.2 rerun under clean target_date split; supersedes invalidated 20260514T141925Z. Decision-gate run for the temporal-split fix.

D. Training run history

Every sklearn training pass, most recent first. This is not the paper-trade history — see section A for that. A training run with Brier test below Brier kalshi_mid on the same rows means the model has signal beyond the market mid in cross-validation.

When (UTC)Feature setn_testBrier testBrier kalshi_midGapVerdictNotes
2026-05-14 19:19 UTCv2840.13230.1305+0.0018MARKET WINSPhase A.2 rerun under clean target_date split; supersedes invalidated 20260514T141925Z. Decision-gate run for the temporal-split fix.
2026-05-12 13:45 UTCv2650.13010.0764+0.0537MARKET WINSschema v2: add intercept + feature_means/stds for live inference
2026-05-11 13:08 UTCv2420.12600.0282+0.0978MARKET WINSFirst A.3 discovery run. V2 = V0 baseline + 6 static geographic features (urban_density_5km, water_pct_10km, forest_pct_5km, elevation_m, distance_to_coast_km, latitude). N=138 resolved, train=96 (older), test=42 (newer). Test slice currently collapses to a single capture date (20260510T171217Z) — limits temporal variance; geographic deltas mostly null on this split as expected.

E. Training Brier trajectory

Learned model (test) vs kalshi_mid (same test rows) across all training runs. Dashed horizontal line is the most recent kalshi_mid Brier as all-time reference; vertical dashed markers flag a feature-set bump (v0 → v1 → v2 …).

0.0000.050.100.15v2bench (0.1305)05-1105-1205-14run (UTC date)Brier score (lower = better)learned (test)kalshi_mid (test)

F. Backtest replays

Replayed records produced by backtest.py against settled Kalshi events. Only strict point-in-time records count toward N_backtest_strict; NAIVE-mode rows are flagged and excluded from the hybrid sample. Filters above narrow both this table and the live runs section.

No backtest replays in the manifest. The aggregate count may still be non-zero — per-record detail is omitted by the manifest builder when the ledger exceeds the inline budget.