Statistical Tests
This page explains the statistical tests used to validate die value randomness quality. These tests run continuously on rolling windows of beacon output.
Note: For tests comparing rng.dev against drand and NIST beacons (hash byte comparison), see Benchmark Tests.
Understanding the Results
P-Values
Every test produces a p-value between 0 and 1. This represents the probability of seeing results this extreme (or more extreme) if the data were truly random.
| P-Value | Status | Meaning |
|---|---|---|
| > 0.05 | PASS | No evidence against randomness |
| 0.01 - 0.05 | WATCH | Borderline — monitor for patterns |
| < 0.01 | INVESTIGATE | Statistically significant deviation |
Expected Failures
Random data will sometimes fail random tests. This is not a bug — it's mathematics.
At significance level α = 0.05:
- 5% of tests will fail even for perfect randomness
- Running 6 tests means ~26% chance at least one fails per window
- This is why we apply Bonferroni correction (divide α by number of tests)
The dashboard shows occasional failures as expected behavior, not system problems. Only persistent, repeated failures indicate potential bias.
Test Descriptions
1. Chi-Squared Distribution Test
What it tests: Are all die faces appearing with equal frequency?
How it works:
- Count occurrences of each face (1-6)
- Compare observed counts to expected counts (n/6 each)
- Calculate chi-squared statistic: χ² = Σ (observed - expected)² / expected
What it detects:
- Biased die (one face appears more often)
- Manufacturing defects in physical dice
- Software bugs favoring certain values
Interpretation:
- High p-value (>0.05): Distribution looks uniform
- Low p-value (<0.01): Some faces appear too often or too rarely
Example:
1000 rolls, expected 166.7 per face
Face | Observed | Expected | Contribution
-----|----------|----------|-------------
1 | 158 | 166.7 | 0.45
2 | 172 | 166.7 | 0.17
3 | 165 | 166.7 | 0.02
4 | 170 | 166.7 | 0.07
5 | 168 | 166.7 | 0.01
6 | 167 | 166.7 | 0.00
─────────
χ² = 0.72
p-value = 0.98 → PASS (no bias detected)
2. Runs Test (Odd/Even Parity)
What it tests: Do odd and even values alternate appropriately?
How it works:
- Convert each die value to parity: odd (1,3,5) → 1, even (2,4,6) → 0
- Count "runs" — consecutive sequences of same parity
- Compare run count to expected value for random sequence
What it detects:
- Too much alternation (odd-even-odd-even pattern)
- Too much clustering (odd-odd-odd-odd pattern)
- Predictable sequencing
Why odd/even instead of above/below median:
- Discrete die values split cleanly: exactly 3 odd, 3 even
- Avoids ambiguity at median (3.5)
- Better statistical properties for 6-sided dice
Interpretation:
- High p-value: Runs count is normal for random data
- Low p-value: Sequence is too clustered or too alternating
3. Streak Distribution
What it tests: Do consecutive same-value sequences follow expected lengths?
How it works:
- Find all "streaks" — consecutive identical values
- Count streaks of each length (1, 2, 3, 4+)
- Compare to theoretical distribution
Expected distribution for fair die:
P(streak length k) = (1/6)^(k-1) × (5/6)
Length | Probability | Per 1000 streaks
-------|-------------|------------------
1 | 83.3% | 833
2 | 13.9% | 139
3 | 2.3% | 23
4 | 0.4% | 4
5+ | 0.1% | 1
What it detects:
- Sticky behavior (too many long streaks)
- Anti-sticky behavior (too few repeats)
- Memory in the random source
Interpretation:
- High p-value: Streak lengths are normal
- Low p-value: Streaks are suspiciously long or short
4. Transition Matrix Test
What it tests: Does the next value depend on the current value?
How it works:
- Build 6×6 matrix counting transitions (e.g., "1 followed by 4")
- For each row, run chi-squared test against uniform distribution
- Report 6 separate p-values (one per starting value)
Expected behavior:
P(next = j | current = i) = 1/6 for all i, j
Transition matrix should look roughly like:
→ 1 2 3 4 5 6
1 16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
2 16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
...
What it detects:
- Markov dependencies ("3 is often followed by 5")
- Mechanical bias in physical dice
- Pseudo-random generator weaknesses
Why 6 separate tests instead of one:
- Raw matrix chi-squared violates independence assumptions
- Row-by-row testing is statistically valid
- Pinpoints which transitions are problematic
Interpretation:
- All rows p > 0.05: No transition bias detected
- One row p < 0.01: That starting value has biased follow-ups
5. Serial Pair Test
What it tests: Do all consecutive pairs appear with equal frequency?
How it works:
- Extract all consecutive pairs: (1,4), (4,2), (2,6), ...
- Count occurrences of each of 36 possible pairs
- Chi-squared test against expected frequency (n/36 each)
Expected behavior:
36 unique pairs, each with probability 1/36 ≈ 2.78%
In 10,000 rolls → ~9,999 pairs → ~278 expected per pair
What it detects:
- Subtle sequential bias invisible to single-value tests
- "1 is more likely to follow 3" patterns
- State-dependent behavior
Difference from transition matrix:
- Transition matrix: conditional probabilities (P(next|current))
- Serial pair test: joint probabilities (P(current AND next))
- Both are valuable; they catch different anomalies
Interpretation:
- High p-value: All pairs appear equally often
- Low p-value: Some pairs are over/under-represented
6. Shannon Entropy
What it tests: How much information content is in the output?
How it works:
- Calculate frequency of each face: p_i = count_i / total
- Compute entropy: H = -Σ p_i × log₂(p_i)
- Compare to theoretical maximum
Theoretical values:
Maximum entropy for 6-sided die:
H_max = log₂(6) ≈ 2.585 bits
This occurs when all faces are equally likely (p = 1/6)
What it detects:
- Low entropy = predictable output
- Concentrated distribution = fewer effective outcomes
- Information loss from bias
Interpretation:
| Observed Entropy | Meaning |
|---|---|
| ~2.585 bits | Perfect — all outcomes equally likely |
| 2.4 - 2.58 bits | Good — minor variation |
| < 2.4 bits | Concerning — some outcomes dominate |
| < 2.0 bits | Serious — significant bias present |
Note: Entropy is always shown as a metric, not a pass/fail test, because it provides intuitive understanding of randomness quality.
7. Autocorrelation
What it tests: Is there correlation between values at different time lags?
How it works:
- For each lag k (1, 2, 3, ... 20):
- Compute correlation between sequence and itself shifted by k
- Check if correlations fall within expected bounds
Expected behavior:
For truly random data:
- Autocorrelation at all lags ≈ 0
- 95% confidence bounds: ±1.96/√n
For n = 1000:
- Bounds ≈ ±0.062
- Values outside bounds suggest correlation
What it detects:
- Periodic patterns (every 10th value repeats)
- Trending behavior
- Poor PRNG with short cycles
Important caveat: Die values (1-6) are categorical, not continuous. Autocorrelation assumes numeric distance matters, but "1 → 6" isn't meaningfully different from "2 → 3" in randomness terms.
Interpretation:
- Use for visualization and trend detection
- Don't rely on it as primary randomness test
- Transition matrix is more appropriate for sequential independence
Rolling Windows
Tests run on multiple window sizes to catch both short-term and long-term anomalies:
| Window | Purpose | Update Frequency |
|---|---|---|
| 100 | Short-term fluctuations | Every round |
| 1,000 | Medium-term patterns | Every round |
| 10,000 | Stable statistics | Every 10 rounds |
| 100,000 | Long-term validation | Every 100 rounds |
Why multiple windows?
- Small windows: Responsive to recent changes, but noisy
- Large windows: Stable statistics, but slow to detect new problems
- Combined view: Best of both worlds
Bonferroni Correction
When running multiple tests, false positives accumulate.
Problem:
- 6 tests at α = 0.05
- Probability of at least one false positive: 1 - (0.95)⁶ ≈ 26%
Solution: Bonferroni correction
- Adjusted α = 0.05 / 6 ≈ 0.0083
- Each test must achieve p < 0.0083 to be "significant"
- Family-wise error rate stays at 5%
Dashboard shows:
- Individual test p-values
- Bonferroni-corrected overall status
- Whether any test is "significant" after correction
What These Tests Cannot Tell You
Statistical tests have limitations:
- Cannot prove randomness — only detect specific types of non-randomness
- Cannot detect all manipulation — adversary might pass all tests
- Cannot predict future values — that's the point
- Will occasionally fail — 5% false positive rate is expected
The dashboard is evidence of quality, not proof of perfection.
Further Reading
- Benchmark Tests — Comparing rng.dev against drand and NIST beacons
- NIST SP 800-22 — Statistical test suite for random number generators
- Diehard Tests — Classic battery of randomness tests
- How It Works — Beacon generation process
- Threat Model — Security assumptions and limitations