Benchmark Tests
This page explains the statistical tests used to compare rng.dev against established randomness beacons: drand and NIST.
Key insight: If rng.dev's output is statistically indistinguishable from these gold-standard beacons, you can trust it for the same use cases.
What We're Comparing
Each beacon produces a 256-bit (32-byte) hash per round. We compare the statistical properties of these hash bytes across beacons.
| Beacon | Output | Cadence |
|---|---|---|
| rng.dev | 256-bit SHA3-256 hash | 1 second |
| drand | 256-bit BLS signature hash | 3 seconds (quicknet) |
| NIST | 512-bit hash | 60 seconds |
For comparison, we:
- Collect N rounds from each beacon (e.g., 1,000 rounds)
- Extract the hash bytes from each round
- Run identical statistical tests on each beacon's output
- Compare the resulting p-values
If all beacons show similar p-values, their outputs are statistically equivalent.
The Four Benchmark Tests
1. Kolmogorov-Smirnov (K-S) Test
What it tests: Do the hash bytes follow a uniform distribution?
How it works:
- Extract all bytes from N rounds (N × 32 bytes for 256-bit hashes)
- Build the empirical cumulative distribution function (ECDF)
- Compare ECDF to the theoretical uniform distribution (0-255)
- Calculate the maximum deviation (K-S statistic)
Why it matters:
A good random hash should produce bytes uniformly distributed across 0-255. Any deviation suggests bias in the underlying generation process.
Perfect uniform distribution:
┌────────────────────────────────────┐
│ ████████████████████████████████ │
│ ████████████████████████████████ │
│ ████████████████████████████████ │
└────────────────────────────────────┘
0 255
Biased distribution (would fail K-S):
┌────────────────────────────────────┐
│ ██████████████████████ │
│ ████████████████ │
│ ████████████ │
└────────────────────────────────────┘
0 128 255
Interpretation:
- p > 0.05: Byte distribution is consistent with uniform random
- p < 0.01: Statistically significant deviation from uniform
Example result:
K-S Test Results (1,000 rounds):
Beacon | D-statistic | p-value | Status
----------|-------------|---------|--------
rng.dev | 0.0089 | 0.89 | PASS
drand | 0.0092 | 0.87 | PASS
NIST | 0.0098 | 0.84 | PASS
2. Chi-Squared Test
What it tests: Do all byte values (0-255) appear with equal frequency?
How it works:
- Count occurrences of each byte value (0-255) across all rounds
- Compare observed counts to expected counts (total_bytes / 256)
- Calculate chi-squared statistic: χ² = Σ (observed - expected)² / expected
- Convert to p-value using chi-squared distribution (255 degrees of freedom)
Why it matters:
The K-S test checks the overall shape of the distribution. Chi-squared checks whether specific byte values are over- or under-represented.
For 1,000 rounds × 32 bytes = 32,000 bytes:
- Expected per value: 32,000 / 256 = 125 occurrences
- Each value should appear ~125 times (±11 for 95% CI)
Interpretation:
- p > 0.05: All byte values appear with expected frequency
- p < 0.01: Some bytes appear too often or too rarely
Example result:
Chi-Squared Test (1,000 rounds):
Beacon | χ² statistic | p-value | Status
----------|--------------|---------|--------
rng.dev | 248.3 | 0.92 | PASS
drand | 251.7 | 0.94 | PASS
NIST | 246.1 | 0.91 | PASS
3. Runs Test
What it tests: Do sequences of increasing/decreasing bytes occur at the expected rate?
How it works:
- Compare each byte to the next: is it larger (+) or smaller (-)?
- Count "runs" — consecutive sequences of same direction
- Compare run count to expected value for random sequences
- Calculate p-value using normal approximation
Why it matters:
Even if individual bytes are uniformly distributed, they might follow patterns. The runs test detects:
- Too much alternation (up-down-up-down)
- Too much momentum (up-up-up-up)
- Hidden sequential structure
Example byte sequence: [42, 88, 91, 67, 45, 78, 234, 12]
Directions: + + - - + + -
Runs: |──1──|──2──|──3──|───4───|
Expected runs for n bytes: (2n - 1) / 3
Interpretation:
- p > 0.05: Run count is normal for random data
- p < 0.01: Sequence has abnormal patterns
Example result:
Runs Test (1,000 rounds):
Beacon | Observed Runs | Expected | p-value | Status
----------|---------------|----------|---------|--------
rng.dev | 21,287 | 21,333 | 0.71 | PASS
drand | 21,198 | 21,333 | 0.68 | PASS
NIST | 21,156 | 21,333 | 0.67 | PASS
4. Serial Correlation
What it tests: Is there correlation between consecutive bytes?
How it works:
- For each byte pair (b[i], b[i+1]), compute their correlation
- Calculate Pearson correlation coefficient across all pairs
- Test whether correlation is significantly different from zero
- Convert to p-value using t-distribution
Why it matters:
Perfect random data has zero correlation between consecutive values. Serial correlation detects:
- Linear predictability (knowing byte N helps predict byte N+1)
- Lagged dependencies
- Poor mixing in the hash function
Zero correlation (ideal):
byte[i+1]
│ · · · · ·
│ · · · · · ·
│ · · · · · ·
└─────────────────→ byte[i]
(random scatter, no pattern)
Positive correlation (bad):
byte[i+1]
│ · · ·
│ · · ·
│ · · ·
└─────────────────→ byte[i]
(larger bytes followed by larger bytes)
Interpretation:
- p > 0.05: No significant correlation (good)
- p < 0.01: Consecutive bytes are correlated (bad)
Example result:
Serial Correlation (1,000 rounds):
Beacon | Correlation | p-value | Status
----------|-------------|---------|--------
rng.dev | -0.0021 | 0.82 | PASS
drand | 0.0034 | 0.77 | PASS
NIST | -0.0028 | 0.79 | PASS
How to Read the Results
The benchmark table shows p-values for each test. Here's how to interpret them:
| P-Value Range | Color | Meaning |
|---|---|---|
| > 0.10 | Green | Strong pass — well within expected range |
| 0.05 - 0.10 | Green | Pass — acceptable |
| 0.01 - 0.05 | Yellow | Borderline — worth monitoring |
| < 0.01 | Red | Investigate — statistically unusual |
Key points:
- Similar p-values across beacons = rng.dev is statistically equivalent
- Occasional low p-values are normal — 5% of tests fail by chance
- All beacons should behave similarly — if one fails and others pass, investigate
Why Compare Against drand and NIST?
| Beacon | Why It's a Gold Standard |
|---|---|
| drand | Threshold BLS signatures from 20+ independent operators; mathematically provable randomness |
| NIST | US government standard; hardware-sourced entropy; decades of cryptographic research |
If rng.dev's statistical properties match these established beacons, you can trust it for equivalent use cases. The comparison provides empirical evidence that our blockchain-derived randomness is as good as purpose-built randomness beacons.
Sample Size Considerations
The benchmark table lets you select different sample sizes:
| Sample Size | Statistical Power | Best For |
|---|---|---|
| 100 rounds | Low — can miss subtle bias | Quick sanity check |
| 1,000 rounds | Medium — catches most issues | Standard monitoring |
| 10,000 rounds | High — detects subtle patterns | Deep analysis |
| 100,000 rounds | Very high — rigorous validation | Publication-quality claims |
Larger samples provide more statistical power but take longer to collect. The default (1,000 rounds) balances responsiveness with reliability.
Technical Implementation
For those implementing their own comparison:
import numpy as np
from scipy import stats
def compare_beacons(rng_hashes: list[bytes],
drand_hashes: list[bytes],
nist_hashes: list[bytes]) -> dict:
"""
Compare three beacons using standard statistical tests.
Each hash is a 32-byte (256-bit) value.
"""
results = {}
for name, hashes in [('rng', rng_hashes),
('drand', drand_hashes),
('nist', nist_hashes)]:
# Flatten all bytes
all_bytes = np.array([b for h in hashes for b in h])
# 1. K-S Test: compare to uniform distribution
ks_stat, ks_p = stats.kstest(all_bytes, 'uniform', args=(0, 256))
# 2. Chi-Squared: byte frequency
observed = np.bincount(all_bytes, minlength=256)
expected = len(all_bytes) / 256
chi2_stat, chi2_p = stats.chisquare(observed, [expected] * 256)
# 3. Runs Test: sequential patterns
runs_p = runs_test(all_bytes)
# 4. Serial Correlation
corr, corr_p = stats.pearsonr(all_bytes[:-1], all_bytes[1:])
results[name] = {
'ks': {'stat': ks_stat, 'p': ks_p},
'chi2': {'stat': chi2_stat, 'p': chi2_p},
'runs': {'p': runs_p},
'serial': {'corr': corr, 'p': corr_p}
}
return results
def runs_test(data: np.ndarray) -> float:
"""Wald-Wolfowitz runs test for randomness."""
# Count runs of increasing/decreasing values
diffs = np.diff(data)
signs = np.sign(diffs)
signs = signs[signs != 0] # Remove ties
runs = 1 + np.sum(signs[:-1] != signs[1:])
n = len(signs)
# Expected runs and variance
expected = (2 * n - 1) / 3
variance = (16 * n - 29) / 90
# Z-score and p-value
z = (runs - expected) / np.sqrt(variance)
p = 2 * (1 - stats.norm.cdf(abs(z)))
return p
Relationship to Die Value Tests
| Test Type | Target | Documented In |
|---|---|---|
| Benchmark tests (this page) | 256-bit hash bytes | Comparing beacons |
| Die value tests | 1-6 derived values | Statistical Tests |
The benchmark tests validate the underlying hash quality. The die value tests validate the derived output used for visualization. Both should pass for a well-functioning beacon.
Further Reading
- Statistical Tests — Tests for die value output
- How It Works — Beacon generation process
- NIST SP 800-22 — Statistical test suite for RNGs
- drand Documentation — Threshold randomness beacon