Measurement & Determinism

Measurement is what the rest of the loop stands on. Statuses, profiles, bundles, and alerts are only as honest as the numbers underneath them. This page covers the two quantities the measurement core works with, the rules that keep those numbers trustworthy, and why the same input always replays to the same output.

The two quantities: θ and δ

Symbol	Name	What it is	Who sees it
θ (theta)	ability estimate	A student’s statistical reading ability on the latent scale, re-estimated after each answered item.	Internal only. Never shown to a student or family, ever.
δ (delta) / delta_prior	item difficulty	How hard an item is. `delta_prior` is the deterministic, formula-derived estimate computed at import, before real calibration.	Developer-facing.

The diagnostic chains θ: the server grades each answer and updates the estimate, then picks the next item near the student’s current level. The student never sees θ or a score. They see tasks and supportive feedback.

Rasch and the Wave 1 placeholder

Amal calibrates item difficulty with a Rasch measurement model. Real calibration refits δ from observed responses (it needs a large response volume per item) and freezes the result into a versioned snapshot.

Wave 1 ships a deterministic mock placeholder for scoring, not the calibrated model. Real Rasch calibration is a later work package. The mock is a fixed, replayable formula. It behaves like the real engine for integration purposes, but the δ values it relies on are formula-derived priors, not field-calibrated. When real calibration lands, every live session reads one pinned calibration_version so historical decisions stay reproducible.

Determinism is a feature (V-6)

The measurement core is deterministic by design (V-6): the same input plus the same rule and calibration version produces byte-identical output, every time. This is what makes the whole loop auditable, because any past decision can be replayed against the exact snapshot that produced it. θ is persisted at full precision and the resolved item difficulties are frozen per session, so a replay months later yields the identical estimate even after a future recalibration.

No LLM in the measurement loop (V-5)

No language model ever participates in measurement (V-5). No LLM computes θ, picks an item, scores a probe, or runs alongside a student session. Statistics measure, humans decide, and LLMs only draft teacher-side artifacts that a teacher reviews. There is a single enforced chokepoint for LLM calls, and it sits nowhere near the scoring path.

No single global percentage (V-3)

There is no overall reading score anywhere (V-3). Status is always reported per-measure or per-macro-domain, never as one rolled-up number. Each measure resolves to one of five states:

MeasureStatus	Meaning
`Meets`	At or above the benchmark for this measure.
`Approaching`	Near the benchmark; watch.
`Below`	Under the benchmark; needs support.
`Severe`	Well under; priority need.
`Not_Assessed`	No comparable evidence yet.

Not_Assessed is never treated as zero and is excluded from every rollup. A missing measure is missing, and it never silently drags a status down. This is the honest-coverage rule that keeps the loop from punishing students for gaps in data.

Append-only audit

Measurement-relevant tables are append-only. Evidence rows, scored responses, and θ estimates are never edited or deleted in place; corrections are new rows. This guarantees the record the engine decided from is exactly the record you can later inspect, which is what makes V-6 replay meaningful.

Where to go next

The Decision Engine: how measured evidence becomes a skill, domain, and macro status.
Diagnostic & Practice: the session APIs that chain θ.
Standards & Benchmarks: where the per-measure cut points come from.
Glossary: θ, δ, Rasch, calibration, and MeasureStatus defined.