VIGIL
Grading · Methodology
// what this grade means

How VIGIL grades a Polymarket trader.

Every wallet VIGIL scores gets a single letter (AF) plus a 0–100 score, computed from six dimensions, penalized for red flags, and — critically — published alongside a bootstrap 95% confidence interval. If the CI spans more than one grade-width, or the sample is too thin to support a reliable estimate, the grade is reported as INS (Insufficient Data) rather than a letter.

The honest version. A letter grade asserts certainty. A letter grade without a confidence interval is a marketing choice pretending to be statistics. We publish every grade with its CI95 so you can see when two traders are statistically distinguishable and when they aren't.

The six dimensions

DimensionWeightWhat it measures
Calibration25%How closely stated probabilities match observed outcomes (Brier Skill Score + calibration error).
Live Edge20%Unrealized PnL trajectory on open positions.
Profitability20%Risk-adjusted realized PnL.
Consistency15%Return stability — low coefficient of variation on per-bet returns.
Discipline10%Position sizing, diversification, no concentration risk.
Sample Size10%Resolved-bet count (logarithmic).

Calibration: Brier Skill Score

For each resolved bet, we compare the trader's revealed probability to the outcome:

Brier = (1/N) · Σ (impliedProb_i − outcome_i)²
BSS   = 1 − Brier / Brier_climatology
Brier_climatology = base_rate · (1 − base_rate)

BSS > 0 means the trader beats a naive "always predict the base rate" forecaster. BSS < 0 means they're worse than doing nothing. We also decompose Brier into Reliability, Resolution, and Uncertainty (Murphy, 1973) to distinguish miscalibrated traders from traders who just bet near 50%.

Reference forecast

The reference probability against which we score each bet is the market mid-price at the trader's entry timestamp, not the final resolution price. Using resolution would be circular — the trader's own position moves the price. Using entry-time mid-price measures genuine forecasting skill relative to the crowd's available information at the moment of the bet.

Bootstrap CI95 on the score

We compute a 95% confidence interval on the composite trust score via non-parametric bootstrap:

1. For each wallet, extract per-bet squared errors e_i = (impliedProb_i − outcome_i)²
2. Resample {e_i} with replacement, N = n_bets, 1000 iterations
3. Recompute mean Brier on each resample
4. Project ΔBrier → ΔCalibrationDim → ΔScore using calibration-dim weight
5. Return 2.5 / 97.5 percentiles as scoreLow / scoreHigh
6. Map score band to grade band via standard grade thresholds

Other dimensions are held at point estimate — calibration is the largest variance driver and the one any critic will attack first.

The "Insufficient Data" rule

A wallet's grade is reported as INS (not a letter) when either:

TriggerThresholdWhy
Sample too smallresolvedBets < 30Below n=30, bootstrap CI on Brier is wider than the grade-width; a letter misrepresents precision.
CI spans gradescoreHigh − scoreLow ≥ 20One grade bucket is 15 points. A 20+ span means we can't distinguish D from B at 95%.

This is a deliberate choice to bias toward honesty. Early VIGIL will show many INS wallets. That is correct.

Grade thresholds (point estimate)

GradeScoreTier
A80–100SHARP — demonstrated calibration, positive BSS, profitable.
B65–79SOLID — net positive skill.
C50–64DEVELOPING — mixed signals.
D35–49RISKY — below naïve baseline or heavy red flags.
F0–34DANGER — net negative signal.

Penalties

Certain patterns deduct from the raw dimension score:

Anti-Sybil difficulty adjustment

Before any letter grade is shown, every wallet passes through an anti-Sybil gate that weights grade-eligibility by how expensive the wallet was to create. The objection we hear most is "what stops someone from spinning up 100 wallets and farming an A?" — the math below is the answer.

ConditionOutcome
Wallet age < 30 days (Basescan-verified)Force-INS regardless of bet count.
Wallet age > 1000 days with < 5 total txsForce-INS (dormant-proxy pattern).
Basescan unavailable + resolvedBets ≥ 100Eligible for grade, consensus-weight factor 0.70 (penalized for age-unknown).
Basescan unavailable + resolvedBets < 100Force-INS (insufficient PoW to distinguish from a fresh Sybil).
OtherwiseEligible; consensus-weight factor scales log between 30 and 730 days, saturating at 1.0.
ageFactor(ageDays) = (log(ageDays) − log(30)) / (log(730) − log(30))
                    clamped to [0, 1]

The factor multiplies into the consensus weight on top of √stake × exp(−days/30) × gradeWeight, so a new Sybil would need real USDC at real risk over real time to move the skill-weighted consensus — not just more wallets. Every report returns an antiSybil block with {eligibleForGrade, ageFactor, reason, ageDaysUsed} so the gate decision is inspectable for every wallet we publish.

What grades are NOT

A VIGIL grade is not:

Opt-out

If you are a Polymarket trader and do not want your wallet graded publicly, email api@vigilscore.xyz with a signed message from the wallet. We'll exclude it from public display (aggregate anonymized stats may still include it).

Reproducibility

Scoring code is deterministic given the same input data. The commit hash of the running version is published at /v1/health. Every score response includes a scoredAt timestamp so you can re-run the same wallet later and see what changed.

Get the JSON

GET /v1/polymarket/:wallet

Response includes trustScore, trustGrade, full calibrationReport, and the bootstrap confidence.ci95 block with {scoreLow, scoreHigh, gradeLow, gradeHigh, insufficientData}.

Known limitations

Polymarket's data-api has a ~3100 trade pagination ceiling per wallet. For very high-volume wallets we sample rather than exhaust. Resolution timestamps are inferred from market close dates — for future versions we'll cross-reference the CLOB log for exact entry time.

We do not currently detect wash-trading within a wallet's own history. Sybil clustering across wallets is in v1.23.