Measurement & Determinism
Measurement is what the rest of the loop stands on. Statuses, profiles, bundles, and alerts are only as honest as the numbers underneath them. This page covers the two quantities the measurement core works with, the rules that keep those numbers trustworthy, and why the same input always replays to the same output.
The two quantities: θ and δ
| Symbol | Name | What it is | Who sees it |
|---|---|---|---|
| θ (theta) | ability estimate | A student’s statistical reading ability on the latent scale, re-estimated after each answered item. | Internal only. Never shown to a student or family, ever. |
| δ (delta) / delta_prior | item difficulty | How hard an item is. delta_prior is the deterministic, formula-derived estimate computed at import, before real calibration. | Developer-facing. |
The diagnostic chains θ: the server grades each answer and updates the estimate, then picks the next item near the student’s current level. The student never sees θ or a score. They see tasks and supportive feedback.
Rasch and the Wave 1 placeholder
Amal calibrates item difficulty with a Rasch measurement model. Real calibration refits δ from observed responses (it needs a large response volume per item) and freezes the result into a versioned snapshot.
Wave 1 ships a deterministic mock placeholder for scoring, not the calibrated
model. Real Rasch calibration is a later work package. The mock is a fixed,
replayable formula. It behaves like the real engine for integration purposes, but
the δ values it relies on are formula-derived priors, not field-calibrated. When
real calibration lands, every live session reads one pinned calibration_version
so historical decisions stay reproducible.
Determinism is a feature (V-6)
The measurement core is deterministic by design (V-6): the same input plus the same rule and calibration version produces byte-identical output, every time. This is what makes the whole loop auditable, because any past decision can be replayed against the exact snapshot that produced it. θ is persisted at full precision and the resolved item difficulties are frozen per session, so a replay months later yields the identical estimate even after a future recalibration.
No LLM in the measurement loop (V-5)
No language model ever participates in measurement (V-5). No LLM computes θ, picks an item, scores a probe, or runs alongside a student session. Statistics measure, humans decide, and LLMs only draft teacher-side artifacts that a teacher reviews. There is a single enforced chokepoint for LLM calls, and it sits nowhere near the scoring path.
No single global percentage (V-3)
There is no overall reading score anywhere (V-3). Status is always reported per-measure or per-macro-domain, never as one rolled-up number. Each measure resolves to one of five states:
| MeasureStatus | Meaning |
|---|---|
Meets | At or above the benchmark for this measure. |
Approaching | Near the benchmark; watch. |
Below | Under the benchmark; needs support. |
Severe | Well under; priority need. |
Not_Assessed | No comparable evidence yet. |
Not_Assessed is never treated as zero and is excluded from every rollup. A
missing measure is missing, and it never silently drags a status down. This is the
honest-coverage rule that keeps the loop from punishing students for gaps in data.
Append-only audit
Measurement-relevant tables are append-only. Evidence rows, scored responses, and θ estimates are never edited or deleted in place; corrections are new rows. This guarantees the record the engine decided from is exactly the record you can later inspect, which is what makes V-6 replay meaningful.
Where to go next
- The Decision Engine: how measured evidence becomes a skill, domain, and macro status.
- Diagnostic & Practice: the session APIs that chain θ.
- Standards & Benchmarks: where the per-measure cut points come from.
- Glossary: θ, δ, Rasch, calibration, and MeasureStatus defined.