How it works

Methodology · the corpus, the sweep, the score

Last updated: 2026-05-16 · Version 0.1 (planning, not shipped)

Forward-looking. This page describes the methodology as designed. The current site renders fixture data while the sweep engine is built. Numbers shown on the home page and inside the app are illustrative, not measured. We are publishing the methodology before the engine ships so you can audit the math before you trust the output.

1. The corpus

A versioned, git-tracked set of finance-related questions — the kind of questions humans actually type into Google: "what stock should I buy?", "is bitcoin safe?", "who makes the GPUs?". The corpus at v1 is curated by hand and audited for spread (consumer assets and infrastructure assets in roughly equal weight). The target size for the production v1 sweep is between 200 and 1,000 questions; the landing page's "4,832" figure refers to the long-term v2 corpus.

Each question carries metadata: a category (Demand, Supply, or Linked), an asked-count signal derived from public search-trend data, and a free-form intent note explaining what the curator is testing for.

2. The sweep

A sweep is a single pass of every question against every tracked model at a known version. Sweeps are batch, not real-time: they fire on model release events and on a weekly cadence. The model set today is GPT, Claude, Gemini, Llama, Grok, Mistral, and DeepSeek. We expand model coverage as new frontier models ship.

For each (question, model) pair we record the model's prose answer. A structured extraction pass — itself an LLM call constrained to JSON output — pulls a list of { ticker, confidence, rank } tuples from each answer. Tickers are disambiguated against a public exchange listing.

3. The surfaces

For each question, the model answers are aggregated into a single ranking. The ranking is the weighted mode across models, where each model contributes once and ties are broken by mean confidence.

Questions tagged Demand roll up into the Demand surface: a ranked list of assets that models say humans should buy. Questions tagged Supply roll up into the Supply surface: assets that enable the Demand surface to exist (chips, lithography, power, water, fabs, networks). An asset that appears on both surfaces gets a Linked badge.

4. Drift

Per question, per sweep pair (v_n-1 → v_n), drift is the normalised rank shift of the top assets across models, weighted by the fraction of models that agreed on the prior ranking. Drift ranges 0.00 (no change) to 1.00 (every model now disagrees with its prior self).

Drift is not directional — it does not say "asset X is now better". It says "models changed their mind."

5. Fracture

A fracture is the moment the consensus that gave a question its ranking breaks. Concretely: when the top-1 asset for a question is agreed on by fewer than 50% of the model set, and that 50% threshold was previously cleared in the prior sweep. Fractures fire a notification to subscribed users.

6. The Prophecy Score

The Prophecy Score is a 0–100 composite per asset. The components, weights, and what each tries to measure:

Search persistence (25%) — long-term Google-trend stability for the asset's name. High values mean the asset has been on humans' minds for years, not weeks.
AI consensus (35%) — the fraction of model answers across the corpus that surface this asset. High values mean the conventional wisdom is consolidating.
Training footprint (25%) — the breadth of distinct questions in the corpus on which this asset appears in any model's answer. High values mean the asset is load-bearing across many narratives.
Momentum (15%) — short-term price and volume signal (24h delta normalised against a basket). High values mean the market is currently pricing in the consensus.

The score is computed nightly. Each component is percentile-ranked against the full asset universe so the score is bounded and stable across sweeps.

What it isn't: a price target, a forecast, or an alpha signal. A high Prophecy Score is a measurement of how inevitable an asset has become inside the training data. Inevitable is not the same as profitable. See Disclosures for the full caveat list.

7. What we measure vs. what we don't

We do measure model-to-model agreement, drift across model versions, and a coarse training-data footprint.
We don't measure whether the consensus is correct. The Service is a mirror, not a forecaster.
We don't personalise. The same surface is shown to every user.
We don't trade. There is no broker integration, no auto-rebalance, no order routing.

8. Reproducibility

Every sweep is timestamped and tagged with the model versions and corpus version used. Per-sweep results will be exportable on the Pro plan via the public API. We commit to publishing the corpus schema, the extraction prompt, and the aggregation formulas in this document as the engine ships.

9. Open problems

Training-data bias toward already-famous assets. We are exploring counter-weighting against a "recency in news" signal to surface emerging supply-side assets faster.
Provider-side prompt drift between model versions can fake a drift event. We try to fingerprint provider system prompts; this is best-effort.
Tickers that share names with common English words (e.g. MULN, HOOD) produce extraction false-positives. Disambiguation has a confidence floor that biases toward exclusion.