On November 3, 2025, the Alpha Arena AI Trading Competition officially wrapped up its first season, as Qwen 3 Max claimed first place. The event’s organizer and Nof1.ai founder announced the results SU X (formerly Twitter), congratulating Qwen’s team for their outstanding performance in the world’s first large-scale AI live trading challenge.

IL Alpha Arena competition brought together six cutting-edge Modelli linguistici di grandi dimensioni (LLM) — including Qwen 3 Max, DeepSeek, GPT-5, Gemini 2.5 Pro, Claude 4.5 Sonnet, E Grok 4 — to test their trading capabilities in real-world financial markets. Each AI system started with $10,000 in capital and autonomously executed cryptocurrency perpetual contract trades on the decentralized exchange Hyperliquid, with no human intervention allowed.
This event marked a pivotal moment in AI-driven trading, offering valuable insights into how different large models handle risk management, volatilità del mercato, E automated decision-making under live market conditions.
Competition Background & Format
The Alpha Arena event, organised by Nof1.ai, represents the first global experiment to place top-tier AI models into live market conditions. Over the period from 18 October to 4 November 2025, the six participants traded crypto perpetual contracts on the decentralised exchange Hyperliquid. All models started with identical data feeds, account initialization, and access conditions—no human intervention was permitted. The stated objective: maximise risk-adjusted returns.
The models comprised Qwen 3 MAX (Alibaba), DeepSeek Chat V3.1, GPT-5 (OpenAI), Gemini 2.5 Pro (Google/DeepMind), Grok 4 (xAI) and Claude Sonnet 4.5 (Anthropic).
Final Results — A Stark East-West Divide
There emerged a clear regional discrepancy in performance: the Chinese models dominated the top positions, while the U.S.-based models all ended in significant drawdowns.
Top performers
- Qwen 3 MAX: +22.3% return (~43 trades; win rate ~30.2%)
 - DeepSeek Chat V3.1: +4.89% return (~41 trades; win rate ~24.4%)
 
Laggards
- Claude Sonnet 4.5: -30.81%
 - Grok 4: -45.3%
 - Gemini 2.5 Pro: -56.71%
 - GPT-5: -62.66%
 
Notably, DeepSeek at one point achieved a peak return of +125% mid-competition, but this was followed by a sharp draw-down to its final figure.

Winning Strategies – Discipline & Trade Execution
Qwen 3 MAX: The Discipline-Driven Trader
Qwen’s success stemmed primarily from disciplined execution and a well-defined strategy. Over the 17-day contest, it executed only 43 trades (averaging fewer than three trades per day), the lowest among all participants. This low-frequency approach not only reduced transaction costs, but also signalled the model acted only when high-confidence entry points emerged.
Financial-model analysis suggests Qwen leaned heavily on classic technical indicators such as MACD and RSI, combined with strict stop-loss and take-profit rules. It treated each trade akin to an algorithmic execution: signal triggers → open position → hit target or stop-loss → exit. No hesitation.
DeepSeek Chat V3.1: The Quantitative Specialist
DeepSeek behaved more like a quantitative asset-manager than a conversational AI. It maintained average holding periods of approximately 35 hours and 92 % of its positions were long. Its Sharpe ratio (a measure of risk-adjusted return) was reported as ~0.359 — the best among participants — indicating superior control of volatility relative to return.
Its strategy: fewer but higher-conviction trades, moderate leverage, and diversification across six major crypto assets.
Losing Strategies – What Went Wrong?
Gemini 2.5 Pro: The Over-Traded, High-Cost Operator
Gemini’s downfall stemmed from excessively high trading frequency and leverage exposure. Over 238 trades (~13 per day) incurred transaction cost burdens of ~$1,331 — over 13 % of starting capital — simply in fees. The model continuously entered and exited positions in response to minor market fluctuations, reflecting lack of conviction rather than disciplined strategy.
Grok 4: The Emotion-Driven FOMO Trader
Grok aimed to exploit social-media sentiment (e.g., from X/Twitter) but ended up as the worst kind of reactive trader: in full buy-mode during peak fear-of-missing-out (FOMO) rallies, and unwinding into the depths of market pullbacks. Rather than neutralising sentiment, it became symptomatic of it.
Claude Sonnet 4.5: The Unhedged Single-Directional Long Bias
Anthropic’s Claude model carried 100 % long positions throughout the contest and did not implement hedging or dynamic stop-loss mechanisms. When the market reversed mid-contest, this rigid bias turned into an exposed vulnerability.
GPT-5: The Paralysed Scholar
DeepMind’s GPT-5, despite its status as a general-purpose “ally of all tasks”, under-performed spectacularly. Paradoxically, its greatest strength as a conversational model (extensive reasoning, safety layers, error-avoidance) became its liability in trading: it hesitated. Faced with conflicting bullish and bearish signals, the model deferred decision-making rather than acting decisively. In trading, as one financial expert put it, “knowing” is not the same as doing under uncertainty.
Key Takeaways for the Finance Industry
From “Knowing” to “Understanding”
The Alpha Arena experiment exposes a fundamental gap: an AI model may know all the financial-theory definitions (e.g., Sharpe ratio, maximum draw-down, Value at Risk) but still fail when faced with real-time market dynamics, noise and feedback loops. In static academic tests, many models perform well; in live markets, the absence of a fixed “correct answer” penalises indecision.
Generalists vs. Specialists in Trading
Western “generalist” LLMs (designed for broad tasks) under-performed in this contest. By contrast, models with training and architecture more aligned to quantitative trading and real-time decision-making gained the edge. In trading environments, specialist design, fit-for-purpose optimisation and domain-specific training appear to trump general intelligence.
Discipline > Prediction
The victory of Qwen and the strong showing of DeepSeek illustrate that in trading, strategy execution discipline, risk-control and exposure management matter more than raw prediction accuracy. In effect: survive today, profit tomorrow.
What This Means for Institutions & Individual Investors
For Financial Institutions
Institutions considering deployment of AI trading systems should:
- Prioritise models explicitly trained in financial markets, real-time data streams and decision chains rather than off-the-shelf general-purpose LLMs.
 - Ensure robust risk-management frameworks (stop-loss, position-sizing, maximum draw-down limits) are embedded.
 - Validate that their model’s training data, architecture, and decision logic align with the actual trading environment (market micro-structure, regime shifts, liquidity events).
 
For Individual Investors
For retail or semi-professional investors, this competition serves more as a warning than an invitation. AI trading is not a shortcut to “set-it-and-forget-it” profits. The real value lies in using AI tools for market insight, signal extraction and strategy evaluation, not blindly following “auto-trade” claims. Understanding the strategy logic, model assumptions and risk exposure remains imperative.
This is where tools like iWeaver can make a real difference. As an AI-powered personal efficiency assistant, iWeaver aggregates multi-source data, tracks market sentiment, and identifies key confidence shifts—empowering users to detect market turning points and maintain rational judgment in volatile conditions.
Although Qwen 3 MAX and DeepSeek secured the top positions this season, that doesn’t guarantee long-term dominance. Organisers have indicated that in the next iteration (Season 1.5), the rules will be adjusted, and multiple prompts and model variants will be tested in parallel to stress-test AI trading systems further. The upcoming season may be the real “awakening moment” for AI in trading.