Chapter 38 — Strategy validation: expectancy, sample size, and edge

Most traders never find out whether they have an edge. They run a method, have a good month and a bad month, adjust something after every losing streak, and never accumulate a clean enough sample to know if any version works. They mistake noise for signal, tinker themselves in circles, and conclude after two years that "trading is just hard."

This chapter is the antidote: how to tell, statistically, whether a gold strategy actually makes money — and how to avoid fooling yourself, which is the harder half.

Win rate is a vanity metric. Expectancy is the number.

A 70%-win-rate strategy can lose money. A 35%-win-rate strategy can mint it. What matters is expectancy — the average profit (in R, your initial risk unit) per trade:

Expectancy (R) = (Win% × AvgWin_R) − (Loss% × AvgLoss_R)

Worked: a strategy wins 40% of the time, average winner +2.5R, average loser −1.0R:

(0.40 × 2.5) − (0.60 × 1.0) = 1.0 − 0.6 = +0.40R per trade

Positive expectancy = an edge. That 40%-win-rate system makes +0.40R per trade — over 100 trades, +40R. Meanwhile a 70%-win-rate system with +0.5R winners and −2R losers:

(0.70 × 0.5) − (0.30 × 2.0) = 0.35 − 0.60 = −0.25R per trade

— bleeds money despite winning most of the time. Stop tracking win rate as your headline. Track expectancy. Win rate only matters as an input to it (and for the psychological reality that low-win-rate systems require enduring long losing streaks).

The sample-size problem — why your last 10 trades tell you almost nothing

Here is the trap that ruins more strategies than any bad setup: judging a method on a tiny sample.

Suppose your true win rate is 50%. How often do you lose 5 in a row purely by chance? 0.5^5 = 3.1% — uncommon but it will happen, repeatedly, over a trading career. Over a few hundred trades you'll see multiple 5-loss streaks from a perfectly good 50% system. A trader who "fixes" the strategy after each streak is responding to noise — and worse, resetting their sample every time, guaranteeing they never accumulate enough trades to know anything.

Rough guidance on what a sample can tell you:

Closed trades	What you can conclude
< 10	Nothing. Pure noise. Do not draw conclusions, do not adjust.
10–20	A weak hint at direction; still dominated by variance.
30	The minimum to begin trusting an expectancy estimate (wide error bars).
100	A reasonably reliable expectancy read for the whole system.
30 per cell	What you need to compare sub-strategies (e.g. archetype × regime).

The discipline that follows: pick a version, run it untouched for at least 30 trades, then evaluate. Every mid-sample adjustment is a fresh experiment that throws away the trades before it. Tinkering feels like progress; it is the single most effective way to never know if you have an edge.

The ways you fool yourself

Validation is mostly about not lying to yourself. The recurring traps:

Survivorship / cherry-picking. "Look how clean this setup was" — chosen after you knew it worked. The only honest test is forward: log the trade before the outcome, and grade every signal, winners and losers.
Overfitting. Tune enough parameters on past data and you'll fit the noise perfectly and the future not at all. The more knobs, the more suspicious a great backtest should make you.
Ignoring costs. A backtest on mid-price with no spread, slippage, or financing (Ch 35) routinely shows an edge that the live account doesn't have. Always net the costs.
Regime blindness. A strategy validated in a trending 2024 may be negative-expectancy in a ranging 2025. Tag every trade with the regime; an edge that only exists in one regime is a conditional edge, not a general one.
Recency bias. Weighting the last few trades far more than the prior fifty. The sample doesn't care which trades were recent.

What a real validation loop looks like

Define the rules precisely — entry, stop, target, sizing — so the same setup is gradeable the same way every time.
Log every signal before the outcome — bias, archetype, regime, entry/stop/target. No retroactive editing.
Let the outcomes resolve mechanically — did it hit target or stop? Record the R, and the max adverse/favourable excursion.
Hold the rules fixed for ≥30 trades. No mid-sample changes.
Compute expectancy overall and per cell (archetype, regime, session). Find what actually pays.
Cut negative-expectancy cells, keep positive ones, then re-validate. One change at a time, each given its own clean sample.

This is slow. It is also the only thing that converts "I think this works" into "this works, and here's the number." Most traders won't do it, which is precisely why most traders don't have a measured edge.

Figure 38.1 — Two equity curves from the same positive-expectancy system, different orderings

Figure 38.1 — Noise masquerading as signal. Two equity curves from the same +0.3R-expectancy system, each 100 trades, differing only in random ordering. One shows a smooth rise; the other a 6-trade drawdown at trade 12 that would have made a tinkerer "fix" a perfectly good system. The point: a single path tells you little; the distribution is the truth.

On goldintel today

The track-record system is this validation loop, automated. Every published brief is logged before its outcome; the outcome worker resolves it mechanically against real 1H bars; the stats aggregate expectancy by archetype, bias, and session. Two design choices encode this chapter directly: the per-cell stats stay marked "inconclusive" below ~10–30 samples (so you don't act on noise), and the weekly reflection only rewrites the strategy after enough closed trades to mean something. Read the track record the way this chapter prescribes — expectancy first, sample size always in mind, and no strategy changes mid-sample.

Common mistakes

Optimising for win rate instead of expectancy.
Drawing conclusions from < 30 trades — and adjusting the strategy on a losing streak that's pure variance.
Resetting the sample with every tweak, so a clean read never accumulates.
Backtesting on mid-price without spread/slippage/financing.
Validating in one regime and assuming the edge generalises.

Key takeaway

An edge is positive expectancy proven over a large-enough, cost-adjusted, regime-tagged sample — and the fastest way to never find yours is to keep tinkering before the sample is big enough to mean anything.