Technology 12 min read

The role of machine learning in modern trading systems.

Machine learning has moved from academic curiosity to production-grade infrastructure. Where ML earns its keep — and where classical methods still win.

Where ML actually helps

Machine learning has been part of quantitative trading research for two decades. What has changed in the last seven years is the move from research-only deployment to production-default deployment, and a sharper sense of where ML genuinely outperforms classical statistical methods versus where it merely costs more to maintain.

Three areas have a strong evidence base. First, non-linear signal interaction — relationships between features that are not well captured by linear factor models. Tree-based ensembles dominate this category. Second, regime detection — identifying which of several latent return-generating processes is currently active. HMMs and recurrent or transformer architectures dominate. Third, structured-prediction problems on text and graph data — earnings-call sentiment, supply-chain risk, news-event extraction — where deep learning has displaced rule-based and bag-of-words approaches comprehensively.

Outside these areas, ML is more often a capacity-multiplier than an edge-creator. It allows researchers to evaluate orders of magnitude more hypotheses than was previously feasible. The hypotheses that survive are usually classical in nature; the ML is the search engine, not the answer.

The signal-to-noise problem

The single defining feature of financial data — and the reason naive ML applications fail in finance — is the signal-to-noise ratio. In standard machine-learning benchmarks (image classification, language modelling), the signal-to-noise ratio is high: the cat is unambiguously a cat, the next word is one of a few plausible options. In financial returns, the signal is a tiny fraction of one percent of the variance; the rest is noise.

The implication is that ML techniques developed for high-SNR domains transfer poorly. Models that fit cleanly on ImageNet — large neural nets with millions of parameters and modest regularisation — overfit catastrophically on financial data. The same architecture trained on the same data with the same hyperparameters will produce a beautiful in-sample Sharpe and zero out-of-sample. The mistake is to ignore the SNR difference.

The institutional response is twofold. First, aggressive regularisation — early stopping, strong L2, dropout, ensemble averaging, all simultaneously. Second, parsimony in architecture choice: prefer the smallest model that captures the hypothesised non-linearity. A tree ensemble with 500 trees and depth 4 is a stronger choice than a 50-million-parameter transformer for most return-prediction tasks, even when the transformer is technically more expressive. The transformer's expressive capacity becomes overfitting capacity in low-SNR domains.

Tree-based methods: the institutional workhorse

Gradient-boosted decision trees — XGBoost, LightGBM, CatBoost — are the institutional workhorse of quantitative ML. They handle non-linear interactions natively, are robust to outliers and missing data, train quickly enough to support the trial volume an institutional research team produces, and degrade gracefully when overfit.

Their strengths are several. Native handling of mixed feature types. Macro indicators, technical indicators, fundamental ratios, and alt-data signals coexist in one model without preprocessing. Native handling of non-linearity and interactions. No explicit feature engineering required for product features, threshold features, or non-monotonic relationships. Interpretability via SHAP values. The contribution of each feature to each prediction is decomposable and reportable, which matters for governance.

Their weakness is comparable to all gradient methods: they are greedy about loss reduction and will fit noise eagerly if regularisation is loose. The institutional discipline is conservative defaults — shallow trees (max depth 3–6), modest leaf count (32–128), aggressive learning-rate schedule with early stopping — and then trial-counted DSR-style evaluation to filter the survivors.

Our XGBoost risk predictor is a depth-6, 500-tree ensemble fit to forecast subsystem underperformance over a forward window. It is not a return predictor; it is a risk predictor, and that distinction matters. Predicting risk has a higher signal-to-noise ratio than predicting return — losses cluster more than gains do — and ML models perform measurably better when targeted at risk rather than return.

Deep learning: where it earns and where it doesn't

Deep learning earned its place in financial ML in three specific use cases. Natural-language processing. Transformer-based language models (FinBERT and successors) have displaced bag-of-words and rule-based approaches for sentiment scoring, event extraction, and topic detection on financial text. The improvement is large and broadly accepted.

Sequence modelling for high-frequency execution. LSTM and transformer architectures fit microsecond-scale order-book dynamics meaningfully better than ARMA-family models. The improvement is small in absolute Sharpe but operationally important at scale.

Image and structured-data fusion. Satellite-derived features and AIS-based logistical features can be processed end-to-end with classical CNNs and graph neural networks respectively. The pre-trained-feature approach (compute features once, feed into a tree ensemble) is usually competitive with end-to-end deep learning at lower cost and higher robustness.

Outside these areas — particularly in medium-frequency return prediction on price-derived features — deep learning has consistently underperformed gradient boosting in most published studies and in our internal evaluation. The combination of low SNR, modest dataset size (decades of daily data is still only ~10K samples per instrument), and high parameter counts produces overfitting that no amount of regularisation reliably solves. The honest evaluation, with proper trial counting and walk-forward testing, almost always favours simpler models.

When classical methods still win

Three classes of problem remain dominated by classical methods, despite ML's broader advance.

Linear factor models for cross-sectional return prediction. The Fama-French and successor factor models, fit by simple OLS or shrinkage regression, remain competitive with — and often superior to — ML approaches when the prediction target is the cross-sectional ranking of expected returns. The reasons are well-understood: the underlying signal is largely linear, the data is cross-sectionally correlated in well-behaved ways, and the simplicity of the model is itself a strong regularising prior.

Mean-reversion and pair-trading signals. Cointegration-based pair selection and Ornstein-Uhlenbeck mean-reversion models are robust, theoretically grounded, and not improved by ML in most settings. A Z-score-based mean-reversion signal with a vol-adjusted threshold continues to deliver in many regimes despite being mathematically primitive.

Risk modelling at the factor level. Multi-factor risk models — APT-family, BARRA-style — fit by OLS or maximum-likelihood remain the institutional standard for portfolio-level risk attribution. ML alternatives improve the marginal precision but introduce interpretability and governance issues that outweigh the precision gain in most institutional contexts.

The pattern is consistent: where the underlying relationship is approximately linear, classical methods are not just adequate but superior. ML's edge is non-linearity, and committing to an ML approach where no non-linearity exists is paying complexity cost without buying complexity benefit.

Production ML: drift, retraining, governance

Research ML and production ML are different disciplines. A model that performs well in research can fail in production for reasons unrelated to the model itself: data-feed changes, distributional drift, infrastructure latency, governance breaks. The institutional ML stack must address each.

Data-distribution drift is the most common production failure. The relationships a model learned from 2018–2022 data may not hold in 2024–2026. The institutional response is twofold: monitoring the drift in input feature distributions in real time, and retraining on a defined cadence (monthly or quarterly for medium-frequency strategies). A model that has not been retrained in twelve months is a ticking failure waiting to happen.

Feature-pipeline integrity. The features that go into the model in production must be exactly the features the model was trained on. A subtle change in vendor data format, a timezone shift, a missing-data convention change — any of these can degrade live performance even though the model itself is unchanged. Production-grade feature pipelines are versioned, tested end-to-end, and monitored for distributional drift on every feature.

Governance. ML predictions enter risk systems and trading systems. The interpretability of those predictions and the auditability of the model's behaviour are not academic concerns; they are operational requirements. SHAP-based attribution, scenario testing, and version control over model weights are all part of an institutional production stack. A model that makes good predictions but cannot be explained or audited is unusable in regulated workflows.

The hybrid approach in practice

The institutional state of the art is not 'ML' or 'classical' but a hybrid stack where each layer uses the technique best suited to it. Signal generation uses a mix of linear factor models for cross-sectional alpha, mean-reversion and momentum for time-series alpha, and tree-based ensembles for non-linear interaction terms. Portfolio construction uses multi-objective evolutionary algorithms rather than ML — the optimisation problem is not a learning problem. Risk monitoring uses ML for non-linear regime detection and classical factor models for attribution.

The principle is to match technique to problem, not to commit to a single technique across the stack. ML is the right answer for some sub-problems and the wrong answer for others, and a research culture that defaults to ML for everything will systematically over-fit and under-perform a culture that picks the right tool for each job.

Across our four live strategies, ML touches three layers. The signal layer uses XGBoost ensembles for several non-linear interaction features but leans on classical factor models for the bulk of cross-sectional ranking. The risk layer uses gradient boosting for subsystem underperformance prediction. The execution layer uses small neural networks for short-horizon order-book dynamics in the FX strategies. The other layers — portfolio optimisation, risk attribution, drawdown control — are intentionally classical. The result is a hybrid stack where ML is a precision tool, not a default, and the strategy is more robust for it.

Discuss this with the research desk.

If your team is working on related problems — risk architecture, portfolio construction, signal research — we are open to a briefing. Institutional and professional partners only.

Request Briefing