Alternative data sources transforming investment decisions.
Beyond price and volume: how satellite imagery, shipping data, and NLP-derived sentiment are folding into institutional decision-making — and where they fall short.
The alt-data spectrum
Twenty years ago, investment data meant prices, volumes, fundamentals, and earnings transcripts. The alternative-data revolution has changed the meaning of investment data into something closer to any signal correlated with future cashflows. Estimates put institutional spending on alt-data above $5 billion per year as of 2025, growing at 25%–30% annually.
The spectrum is wide. Geospatial data — satellite imagery, weather, atmospheric chemistry — feeds commodity, retail, and macroeconomic models. Transactional data — credit-card panels, e-receipt aggregators, point-of-sale feeds — feeds revenue nowcasts. Behavioural data — web traffic, app downloads, search trends — feeds demand forecasts. Textual data — news, filings, social media, earnings calls, analyst reports — feeds NLP-driven sentiment, event extraction, and topic monitoring. Logistical data — AIS ship positions, flight tracking, freight rates — feeds trade-flow models and supply-chain analytics.
Each category has heroes and villains. The strategic question is not which alt data has alpha but which alt data has alpha specifically for our strategy, after costs, and after the alpha is shared with the rest of the market. Most alt-data vendors have many institutional clients; the resulting signal is rarely as proprietary as the sales pitch suggests.
Satellite imagery: real applications and over-promised ones
Satellite-derived signals have a long-running place in commodity research and a more recent place in equity research. The classic applications are real and well-validated: oil-tank fill levels at major terminals (Cushing, Saldanha Bay) feed crude inventory nowcasts; parking-lot car counts at major US retail chains feed quarterly revenue nowcasts; night-light intensity across emerging-market regions feeds GDP-growth proxies.
Each has a long live track record and a clear theoretical mechanism. None is exclusive — the data is sold to many institutional clients, and the alpha decay since first publication has been steady. A signal that produced a 0.5 Sharpe contribution in 2015 may produce a 0.1 contribution today after broad adoption. The alpha that remains is largely in the speed-of-update and the quality of pre-processing rather than in the raw data.
The over-promised applications are the visually impressive ones. Counting cars in a single shopping mall's parking lot once per week produces a noisy signal that takes years to validate. Counting cars across all US retail locations, hourly, with cloud-cover correction and seasonality detrending, produces a usable feed — but at a vendor cost of seven figures annually, and with a signal that has been heavily trafficked. The institutional discipline is to validate the alpha-after-cost-after-decay before signing the contract.
Shipping and trade-flow data
AIS — Automatic Identification System — is a maritime safety transponder system that broadcasts ship position, speed, and cargo capacity in real time. The data is publicly available, but vendors clean and enrich it with port mapping, cargo classification, and historical comparables. The result is a near-real-time view of global trade flows.
Applications cluster around commodity flows (crude, LNG, dry bulk, container) and regional macroeconomic activity. A sustained drop in container traffic into US West Coast ports leads US retail import data by 2–4 weeks; an unusual concentration of LNG carriers off the European coast presaged the 2022 storage build well before the data appeared in official statistics. For commodity-trading desks and macro funds, AIS-derived signals have moved from differentiation to baseline.
Limitations are real. AIS coverage is sparse in some regions (West African coastline, parts of South-East Asia). Cargo classification by transponder type is imperfect — a tanker can carry crude or distillate. Vessel identifier spoofing is a recurring issue, particularly during sanctions periods. The institutional default is to use AIS as a real-time supplement to traditional trade data rather than a replacement, and to model the residual uncertainty explicitly.
NLP and textual data
Natural-language processing on financial text is now mature enough to be a baseline rather than a differentiator. Three application classes dominate institutional use.
Sentiment scoring on news, filings, earnings transcripts, and analyst reports produces a continuous tone signal that, at the firm-day level, has a small but persistent correlation with subsequent returns. Modern transformer-based models (FinBERT and successors) score domain-specific text reliably; the remaining alpha is in coverage, latency, and entity disambiguation rather than model architecture.
Event extraction identifies discrete corporate events — M&A rumours, executive departures, product launches, regulatory actions — at far higher recall than human analysts. The signal is most useful where events drive return jumps and where the extraction is reliably faster than the broader market reaction. Latency advantages of seconds matter; advantages of minutes do not, since the market reaction is largely complete.
Topic and theme detection is the most recent frontier. Embedding-based clustering of large news corpora identifies emerging themes — supply-chain risk, AI capex, regulatory pressure — and the firms most exposed to them. The signal is medium-frequency (weeks to months), supports thematic rotation strategies, and is harder to commoditise because the themes themselves are non-stationary. Across our four strategies, NLP-derived theme exposure is one of the inputs to the signal weighting layer.
Web traffic, app data, and consumer panels
Web-traffic and app-engagement data — SimilarWeb, App Annie, Sensor Tower, and a dozen smaller vendors — provide near-real-time view into consumer-internet company demand. The data was a meaningful alpha source from roughly 2015 through 2020 for US large-cap consumer internet stocks. It has since been broadly adopted, and the residual alpha is concentrated in (a) coverage of smaller names, (b) cross-platform attribution, and (c) leading indicators of subscription churn.
Credit-card panels aggregate transactional data from millions of consumers and produce category- and merchant-level revenue nowcasts. At the merchant level, panel-derived nowcasts predict reported quarterly revenue with high R² for restaurants, retail, and direct-to-consumer brands. The remaining alpha after cost is meaningful for hedge funds running specific strategies (event-driven around earnings, long-short consumer) and minimal for diversified factor strategies.
Both data types share three pitfalls. Sample bias is the most common: a panel that over-represents one income bracket or geography produces signals that don't generalise. Coverage drift as the panel composition changes over time corrupts historical comparisons. Disclosure latency — the gap between transaction and panel availability — varies by vendor and matters for shorter-horizon strategies. Vendor due diligence on these three dimensions separates usable feeds from unusable ones.
Common pitfalls in alt-data evaluation
The standard alt-data sales process produces an attractive backtest in a vendor-controlled environment. The institutional buyer's job is to assume that backtest is wrong until proved otherwise. Three pitfalls recur.
Survivorship and selection in the vendor's history. A vendor's three-year backtest may be drawn from a panel that excluded firms that went bankrupt during the period, or restricted to a universe selected with the benefit of hindsight. Replicating the backtest on point-in-time data, fully out-of-sample, on a universe formed independently is the minimum sanity check.
Crowding and decay. A signal sold to twenty institutional clients is decaying in alpha as fast as the slowest client implements it. Vendors rarely disclose client lists. The proxy is to ask when the data product launched, who were the early adopters, and what has happened to the published alpha estimates since. A signal with a flat alpha curve since 2018 is more credible than one with a steeply declining curve.
Cost of integration. Alt-data feeds typically arrive in a non-standard format, with idiosyncratic quirks (timezones, identifier mappings, missing-data conventions). Integration is rarely cheaper than the headline data fee. The total cost of ownership over three years often runs 1.5x–2x the listed annual price. The signal must be alpha-positive after that real total cost, not after the sticker price.
Where alt-data adds value — and where it does not
The honest answer to the alt-data question is that the field has matured. The first generation of signals — broad satellite, basic NLP, vanilla credit-card panels — is largely commoditised. The second generation — fine-grained logistics, multi-modal fusion, theme detection — has alpha but is expensive and operationally heavy. The third generation, which is just emerging, focuses on rare, hard-to-replicate datasets where the operator has structural access (regulatory filings in non-English jurisdictions, energy-grid telemetry, satellite-derived climate proxies for parametric insurance).
For institutional managers, alt-data spend has the highest return when it is targeted at structural information gaps rather than incremental factor refinement. A commodity desk gains more from real-time AIS coverage of a specific trade lane than from another generic sentiment feed. A consumer hedge fund gains more from category-level credit-card panels than from another satellite parking-lot count.
Across our strategies, alt-data is treated as one input among many to the optimisation layer rather than a stand-alone alpha source. The multi-objective optimiser weights alt-data-derived subsystems alongside price-based subsystems, and the correlation discipline ensures that alt-data exposure does not concentrate the portfolio on any single information source. Alt-data is a marginal improvement; it is rarely the strategy.
Discuss this with the research desk.
If your team is working on related problems — risk architecture, portfolio construction, signal research — we are open to a briefing. Institutional and professional partners only.
Request Briefing