Every wallet VIGIL scores gets a single letter (A–F) plus a 0–100 score, computed from six dimensions, penalized for red flags, and — critically — published alongside a bootstrap 95% confidence interval. If the CI spans more than one grade-width, or the sample is too thin to support a reliable estimate, the grade is reported as INS (Insufficient Data) rather than a letter.
| Dimension | Weight | What it measures |
|---|---|---|
Calibration | 25% | How closely stated probabilities match observed outcomes (Brier Skill Score + calibration error). |
Live Edge | 20% | Unrealized PnL trajectory on open positions. |
Profitability | 20% | Risk-adjusted realized PnL. |
Consistency | 15% | Return stability — low coefficient of variation on per-bet returns. |
Discipline | 10% | Position sizing, diversification, no concentration risk. |
Sample Size | 10% | Resolved-bet count (logarithmic). |
For each resolved bet, we compare the trader's revealed probability to the outcome:
Brier = (1/N) · Σ (impliedProb_i − outcome_i)² BSS = 1 − Brier / Brier_climatology Brier_climatology = base_rate · (1 − base_rate)
BSS > 0 means the trader beats a naive "always predict the base rate" forecaster. BSS < 0 means they're worse than doing nothing. We also decompose Brier into Reliability, Resolution, and Uncertainty (Murphy, 1973) to distinguish miscalibrated traders from traders who just bet near 50%.
The reference probability against which we score each bet is the market mid-price at the trader's entry timestamp, not the final resolution price. Using resolution would be circular — the trader's own position moves the price. Using entry-time mid-price measures genuine forecasting skill relative to the crowd's available information at the moment of the bet.
We compute a 95% confidence interval on the composite trust score via non-parametric bootstrap:
1. For each wallet, extract per-bet squared errors e_i = (impliedProb_i − outcome_i)²
2. Resample {e_i} with replacement, N = n_bets, 1000 iterations
3. Recompute mean Brier on each resample
4. Project ΔBrier → ΔCalibrationDim → ΔScore using calibration-dim weight
5. Return 2.5 / 97.5 percentiles as scoreLow / scoreHigh
6. Map score band to grade band via standard grade thresholds
Other dimensions are held at point estimate — calibration is the largest variance driver and the one any critic will attack first.
A wallet's grade is reported as INS (not a letter) when either:
| Trigger | Threshold | Why |
|---|---|---|
| Sample too small | resolvedBets < 30 | Below n=30, bootstrap CI on Brier is wider than the grade-width; a letter misrepresents precision. |
| CI spans grade | scoreHigh − scoreLow ≥ 20 | One grade bucket is 15 points. A 20+ span means we can't distinguish D from B at 95%. |
This is a deliberate choice to bias toward honesty. Early VIGIL will show many INS wallets. That is correct.
| Grade | Score | Tier |
|---|---|---|
A | 80–100 | SHARP — demonstrated calibration, positive BSS, profitable. |
B | 65–79 | SOLID — net positive skill. |
C | 50–64 | DEVELOPING — mixed signals. |
D | 35–49 | RISKY — below naïve baseline or heavy red flags. |
F | 0–34 | DANGER — net negative signal. |
Certain patterns deduct from the raw dimension score:
Penny-lottery: wallets with ≥50% of bets at sub-$0.10 stakes get a 15pt penalty (or 25pt if ≥80% penny). Reduced to zero if BSS > 0 — if the strategy works, it's not farming.Receive-only: zero-outbound wallets are lightly flagged (5pt). Many Polymarket proxy/settlement addresses are legitimate.Penny + Receive-only + BSS < 0: hard-capped at D (49). Only applies when all three conditions hold.Before any letter grade is shown, every wallet passes through an anti-Sybil gate that weights grade-eligibility by how expensive the wallet was to create. The objection we hear most is "what stops someone from spinning up 100 wallets and farming an A?" — the math below is the answer.
| Condition | Outcome |
|---|---|
Wallet age < 30 days (Basescan-verified) | Force-INS regardless of bet count. |
Wallet age > 1000 days with < 5 total txs | Force-INS (dormant-proxy pattern). |
Basescan unavailable + resolvedBets ≥ 100 | Eligible for grade, consensus-weight factor 0.70 (penalized for age-unknown). |
Basescan unavailable + resolvedBets < 100 | Force-INS (insufficient PoW to distinguish from a fresh Sybil). |
| Otherwise | Eligible; consensus-weight factor scales log between 30 and 730 days, saturating at 1.0. |
ageFactor(ageDays) = (log(ageDays) − log(30)) / (log(730) − log(30))
clamped to [0, 1]
The factor multiplies into the consensus weight on top of √stake × exp(−days/30) × gradeWeight, so a new Sybil would need real USDC at real risk over real time to move the skill-weighted consensus — not just more wallets. Every report returns an antiSybil block with {eligibleForGrade, ageFactor, reason, ageDaysUsed} so the gate decision is inspectable for every wallet we publish.
A VIGIL grade is not:
If you are a Polymarket trader and do not want your wallet graded publicly, email api@vigilscore.xyz with a signed message from the wallet. We'll exclude it from public display (aggregate anonymized stats may still include it).
Scoring code is deterministic given the same input data. The commit hash of the running version is published at /v1/health. Every score response includes a scoredAt timestamp so you can re-run the same wallet later and see what changed.
GET /v1/polymarket/:wallet
Response includes trustScore, trustGrade, full calibrationReport, and the bootstrap confidence.ci95 block with {scoreLow, scoreHigh, gradeLow, gradeHigh, insufficientData}.
Polymarket's data-api has a ~3100 trade pagination ceiling per wallet. For very high-volume wallets we sample rather than exhaust. Resolution timestamps are inferred from market close dates — for future versions we'll cross-reference the CLOB log for exact entry time.
We do not currently detect wash-trading within a wallet's own history. Sybil clustering across wallets is in v1.23.