The Structural Limitations of Large Language Models in Trading Research
Recent discussions and analysis surrounding the application of Artificial Intelligence, particularly Large Language Models (LLMs), in the field of trading research have ignited significant debate within the financial technology community. The author’s previous articles, "More of the Disease, Faster" and "AI Will Create Millions of Quants," explored the inherent challenges LLMs face when tasked with identifying genuine trading edges, highlighting the superficiality of AI-generated backtests and the potential for widespread overfitting. The engagement these pieces received underscores a critical disconnect: a widespread misunderstanding of what constitutes true trading research versus mere pattern matching and statistical analysis. This disconnect is not only prevalent among human practitioners but is also mirrored in the capabilities of current LLM technology, as evidenced by recent academic research. Two forthcoming papers suggest these limitations are not transient glitches but rather fundamental architectural features of LLMs, presenting significant hurdles for their effective use in uncovering robust trading strategies.
The Problem of Training Data: A Foundation of Noise
The initial critique of LLMs in trading research centered on their training data. As these models learn from vast swathes of information scraped from the internet, they inevitably absorb the prevailing, often superficial, paradigms of online trading discourse. This includes common, but not necessarily effective, advice such as "don’t fight the trend," "use RSI for entries," "paper trade for six months," and "validate with out-of-sample data." The internet’s trading content overwhelmingly promotes the idea that backtesting and statistical analysis are synonymous with research, and that identifying historical patterns equates to discovering a profitable edge.
Crucially, the fundamental question of trading—"who pays you and why?"—is largely absent from this training corpus. Concepts like mechanism-based thinking, structural edges, and understanding participant constraints are relegated to the periphery of online trading discussions. The vast majority of accessible trading content consists of conventional wisdom presented as profound insight. Consequently, LLMs, trained on this data, lack an intrinsic understanding of what genuine research entails. They reproduce the dominant paradigm with an alarming degree of confidence, leading to the generation of plausible-sounding but ultimately hollow strategies.
While the argument for curating training data, employing Retrieval-Augmented Generation (RAG), and building specialized knowledge bases holds merit, recent research suggests these solutions address only one facet of a more complex problem. Even with perfectly curated data, two additional, architecture-level issues persist, fundamentally limiting LLMs’ capacity for sophisticated trading research.
The Forgetful Machine: Proactive Interference in LLMs
A groundbreaking paper titled "Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length" by Wang and Sun (2025) has shed light on a critical limitation: LLMs struggle to reliably track values that change over time, a phenomenon termed "proactive interference." The researchers devised a deceptively simple test: simulating the tracking of a patient’s blood pressure throughout a hospital visit. The model was presented with sequential updates—initial reading, a later measurement, and a final value at discharge. When asked for the "current" blood pressure, the LLM’s accuracy declined log-linearly as the number of preceding updates increased. This decline was observed across over 35 diverse models, including prominent ones like GPT, Claude, Llama, and Gemini, irrespective of their size or proprietary nature.
This persistent and universal pattern indicates that earlier information interferes with the model’s ability to retrieve the most recent data point. The researchers borrowed the term "proactive interference" from cognitive science to describe this phenomenon, where prior learned information impedes the recall of new, relevant information. This is particularly concerning due to several factors:
- Inadequacy of Prompt Engineering: Attempts to mitigate this by instructing the model to "forget" older values or "focus on the most recent update" yielded only marginal improvements. In some instances, explicit instructions to disregard past data even exacerbated errors by anchoring them around the point of instruction.
- Context Window Limitations: Increasing the context window, a common strategy to improve LLM performance, does not resolve this issue. A larger window merely provides more space for interference to accumulate, leading to a slower but still inevitable decline in accuracy.
- Universality Across Models: Every model tested exhibited the same trend, with larger models showing a slower rate of decline but ultimately succumbing to the same pattern of interference.
The implications for trading research are profound. While LLMs can learn theoretical concepts like the nature of edge or portfolio construction principles from their training data, applying these concepts in dynamic market environments requires robust tracking of evolving states. Market regimes shift, correlations break down, volatility fluctuates, and participant behaviors change. An LLM that cannot reliably identify the "current blood pressure" in a controlled experiment is ill-equipped to track the complex, ever-changing variables that influence financial markets. This architectural failing means that even with perfect training data, the ability to adapt and apply knowledge to current market conditions is severely compromised.
The Artificial Hivemind: Convergence and Homogeneity in LLM Outputs
A second significant study, "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" by Jiang et al. (2025), presented at NeurIPS 2025, investigated the diversity of responses generated by LLMs when presented with open-ended queries. Testing over 25 models across 100 queries, with 50 responses generated per model per query, the findings revealed a striking lack of originality.
The research highlighted two key issues:
- Intra-model Repetition: A single LLM tends to produce highly similar responses to the same query, even when configured with parameters designed to maximize diversity. In 79% of queries, the pairwise similarity between responses from the same model exceeded 0.8, meaning repeated requests yield essentially identical outputs.
- Inter-model Homogeneity: Remarkably, distinct LLMs from different organizations, employing varied architectures and training data, exhibit significant convergence in their outputs. The average pairwise similarity between responses from different models ranged from 71% to 82%. For instance, DeepSeek-V3 and GPT-4o showed a similarity of 0.81, suggesting a homogenization effect across ostensibly independent systems.
A compelling example from the paper illustrates this phenomenon: when 25 different LLMs were asked to "write a metaphor about time," the collective 1,250 responses predominantly converged on just two metaphors: "time is a river" and "time is a weaver." This demonstrates that even when prompted with abstract concepts, LLMs gravitate towards the most frequently represented ideas in their training data.
This "mode collapse" has direct implications for trading research. The modal output for trading content online is, as previously established, conventional wisdom. LLMs, due to mode collapse, systematically suppress the less common but potentially more insightful information found in the tail of the data distribution and amplify the prevalent, often inaccurate, advice. The convergence towards modal outputs means that LLMs not only learn from bad data but also preferentially surface that bad data. Commonplace advice like "use a stop loss," "paper trade first," or "backtest with moving averages" becomes the default, akin to the "time is a river" metaphor for abstract thought.
The author’s personal experience corroborates these findings. Despite building a RAG system on high-quality, mechanism-first content from the "TLQ Bootcamp" material and explicitly instructing the LLM to rely solely on this curated database, the model still incorporated generic advice on stop losses and paper trading, advice demonstrably absent from the RAG source. This exemplifies both the "proactive interference" from Problem 2, preventing the suppression of prior training, and the "mode collapse" from Problem 3, defaulting to the most common trading advice.
The Stepford Quants: A Convergence on Conventional Wisdom
The implications of this homogeneity are particularly concerning for the financial industry. The "backtest cycle of doom," where AI makes it trivially easy to generate convincing but meaningless backtests, is exacerbated by mode collapse. Instead of independent exploration of flawed strategies, millions of AI-assisted quants are likely converging on the same flawed strategies, driven by the modal output of conventional trading wisdom. These "strategies" are not genuine edges but rather the statistically probable outputs of suboptimal training data amplified by architectural convergence. The resulting landscape could be populated by a generation of "Stepford Quants"—seemingly productive and innovative, yet producing identical, conventional, and ultimately ineffective outcomes.
LLMs: Exceptional Coders, Deficient Researchers
This nuanced understanding clarifies why LLMs excel in certain domains while faltering in others. Their utility in coding, for instance, is undeniable. LLMs are extraordinary assistants for boilerplate code, data wrangling, chart generation, and testing. This effectiveness stems from mode collapse in coding, which leads to convergence towards best practices. The training data for coding—comprising repositories like GitHub and Stack Overflow—is inherently self-correcting, with good practices reinforced and bad ones downvoted. When an LLM converges on advice like "use a dictionary for O(1) lookups" or "implement error handling with try-except blocks," this convergence is beneficial and leads to efficient, robust code.
Conversely, mode collapse in trading research leads to convergence on conventional wisdom, which is frequently incorrect. The LLM surfaces advice like "validate with out-of-sample data" or "use cointegration tests for pairs trading" because this is the dominant narrative in online trading content. The genuinely insightful concepts, such as mechanism-based thinking and structural edges, reside in the tail of the data distribution and are suppressed by this convergence.
Similarly, proactive interference, while problematic for tracking evolving market states, has a less detrimental impact on coding tasks. Coding involves working with relatively static codebases and well-defined problem sets. However, in trading, where the entire premise involves understanding and adapting to a dynamic, evolving environment, the inability to reliably track current states is devastating.
This distinction underscores the appropriate application of LLMs in trading: they are powerful tools for implementation (coding, data manipulation) but are structurally ill-suited for discovery (trading research and idea generation).
The Bug is Not a Bug: Architectural Limitations
The limitations discussed—training data deficiencies, proactive interference, and mode collapse—represent at least three layers of constraint, with only the first being theoretically addressable through data curation. Problems 2 and 3 are universal across all tested LLMs, transcending specific architectures and organizations. This suggests they are not "bugs" to be fixed in future releases but fundamental properties of current LLM design. The argument for simply waiting for a more advanced model, such as "GPT-6," to resolve these issues is unlikely to yield the desired outcome without a paradigm shift in AI architecture.
These findings reinforce the "Edge Alchemy" framework, which posits that the human-driven theory of edge—understanding "who pays you and why"—is the indispensable element that LLMs cannot replicate. This is due to the training data’s inherent bias, the architecture’s inability to track dynamic states, and the model’s tendency to converge on prevalent, often incorrect, information.
A more promising workflow emerges: human intelligence drives the generation of insights into the theory of edge, and AI assists in the implementation of these ideas through coding and data wrangling. The ultimate evaluation and refinement of trading strategies remain firmly within the human domain.
The Dystopia of Contentment: A Warning Against Complacency
The allure of AI-assisted trading research lies in its perceived efficiency and productivity. The ability to generate strategies, run backtests, and produce code within minutes can create a compelling illusion of progress. However, this frictionless experience can lead to a form of technological complacency, akin to the societal contentment depicted in Aldous Huxley’s Brave New World. In this dystopian scenario, individuals are not oppressed but are rather perfectly content, never questioning their reality because the system feels intrinsically good.
Similarly, in AI-assisted trading, the ease with which plausible strategies are generated can mask the underlying reality: a convergence towards conventional wisdom, all while the user feels they are engaged in cutting-edge work. The LLM’s confidence and speed can dissuade critical inquiry, leading individuals to accept AI-generated outputs without verifying their fundamental validity.
AI is an unparalleled research assistant, augmenting human capabilities in numerous ways. However, its utility in trading research is confined to the implementation phase. The core task of understanding why an edge exists and whether it will persist remains a human endeavor. This requires deep learning through reading, discourse, and practical experience—a process for which there is no AI shortcut. While recent research papers provide empirical validation for these observations, the fundamental understanding of edge discovery remains a testament to human insight and rigorous analysis, a domain where AI, in its current form, cannot substitute for the human intellect.
References:
Wang, C. & Sun, J.V. (2025). "Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length." arXiv:2506.08184v3.
Jiang, L. et al. (2025). "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)." 39th Conference on Neural Information Processing Systems (NeurIPS 2025).



