Systematic comparison of reasoning capabilities across GPT-4, Claude-3.5, and Gemini-1.5-Pro: a multi-dimensional analysis

Brazhenko Dmitry; Markin Kirill

Introduction

The rapid advancement of Large Language Models (LLMs) has fundamentally transformed artificial intelligence research, with systems like GPT-4, Claude, and Gemini demonstrating unprecedented capabilities across diverse domains. However, systematic evaluations of their reasoning abilities remain fragmented, often focusing on narrow task domains or single-model assessments. This limitation hinders both scientific understanding and practical deployment decisions.

Reasoning represents a fundamental cognitive capability encompassing multiple dimensions: logical deduction, mathematical problem-solving, causal inference, analogical thinking, and common-sense understanding. While existing benchmarks provide valuable insights, they typically evaluate models in isolation or lack comprehensive multi-dimensional analysis across leading systems.

This study addresses three critical research questions:

How do leading LLMs compare across systematic reasoning tasks?
Do models exhibit specialized cognitive profiles or uniform capabilities?
What relationship exists between reasoning quality and task accuracy?

We introduce a novel evaluation framework testing 42 carefully designed reasoning tasks across five categories, with multiple trials ensuring statistical robustness. Our comprehensive analysis of 756 experiments reveals surprising performance hierarchies and distinct cognitive signatures for each model family.

Methodology

Experimental Design

Our evaluation framework employs a systematic approach to assess reasoning capabilities across multiple dimensions. We selected five core reasoning categories based on established cognitive science literature:

Logical Reasoning: Deductive logic, syllogisms, propositional reasoning.
Mathematical Reasoning: Word problems, algebra, pattern recognition, probability.
Causal Reasoning: Cause-effect relationships, counterfactuals, temporal causation.
Analogical Reasoning: Verbal analogies, cross-domain mapping, metaphorical thinking.
Common Sense Reasoning: Physical intuition, social cognition, practical problem-solving.

Each category contains 7–10 carefully constructed tasks, totaling 42 unique reasoning challenges. To ensure statistical reliability, each task was administered three times per model, yielding 756 total experimental trials across six model variants.

Models Evaluated

We evaluated six leading LLMs representing different architectural approaches, training methodologies, and efficiency tiers:

GPT-4 Turbo: OpenAI’s flagship model (gpt-4-turbo-preview).
GPT-4o-Mini: OpenAI’s efficient reasoning model (gpt-4o-mini).
Claude-3.5-Sonnet: Anthropic’s premium reasoning model (claude-3-5-sonnet-20241022).
Claude-3-Haiku: Anthropic’s fast reasoning model (claude-3-haiku-20240307).
Gemini-1.5-Pro: Google’s premium multimodal system (gemini-1.5-pro).
Gemini-1.5-Flash: Google’s efficient reasoning system (gemini-1.5-flash).

All models were queried with identical prompts under controlled conditions (temperature=0.1, max_tokens=1000) to ensure experimental consistency. This design enables direct comparison between premium models and their efficient counterparts within each model family.

Evaluation Metrics

We developed a multi-dimensional evaluation framework capturing both quantitative performance and qualitative reasoning characteristics:

Accuracy Score

Task accuracy was computed using automated answer extraction and comparison against ground truth:

(1)

Where A_i represents the accuracy score for task i.

Reasoning Quality Score

We assessed the quality of reasoning processes through linguistic analysis using a weighted composite metric:

, (2)

The weighting scheme reflects established cognitive science principles and empirical validation:

(0.3 weight): Step-by-step reasoning indicator. Receives highest weight as systematic decomposition is the strongest predictor of reasoning quality. Measured by presence of sequential markers ("first", "then", "therefore").
(0.2 weight): Logical connectors usage. Critical for argument coherence, measuring explicit causal and conditional statements ("because", "if-then", "given that").
(0.2 weight): Domain-specific terminology. Indicates conceptual understanding through appropriate technical vocabulary usage within each reasoning category.
(0.2 weight): Argument structure coherence. Evaluates logical flow and consistency of reasoning chain through semantic similarity analysis.
(0.1 weight): Response substantiality. Receives lowest weight as length alone is weakly correlated with quality, but extremely brief responses typically lack sufficient reasoning detail.

Consistency Measure

Model consistency across multiple trials was quantified as:

, (3)

Where C_j represents consistency for task j and denotes standard deviation across trials.

Overall Performance

The comprehensive performance metric combines accuracy and reasoning quality:

, (4)

With weights w₁ = 0.7 (accuracy) and w₂ = 0.3 (quality).

Statistical Significance

To assess the significance of performance differences, we employ the Welch’s t-test:

(5)

Where represents mean accuracy for model , the variance, and the sample size.

Effect Size Measurement

Cohen’s d quantifies the practical significance of performance differences:

(6)

Model Efficiency Index

We introduce a novel efficiency metric that balances reasoning performance with computational cost:

, (7)

Where represents mean accuracy (0-1 scale) and is average response time in seconds.

Design Rationale: The logarithmic transformation of response time addresses the non-linear relationship between computational cost and practical utility. While the difference between 1s and 2s response time is perceptually significant, the difference between 10s and 11s is negligible. The +1 offset prevents division by zero for instantaneous responses. This formulation penalizes both accuracy degradation and excessive latency, making it suitable for comparing models across different performance-efficiency trade-offs. Values above 50 indicate strong efficiency, while values below 20 suggest suboptimal performance-speed balance.

Results

Overall Performance Analysis

Our comprehensive evaluation reveals significant performance disparities across models, challenging conventional assumptions about LLM capabilities. Figure 1 visualizes the key performance metrics across all six models.

Fig. 1. Overall Performance Comparison Across All Models. The figure shows four key metrics: (a) Mean accuracy scores with Claude-3-Haiku achieving 76.3% accuracy, (b) Reasoning quality scores varying across models, (c) Confidence scores showing consistent performance, and (d) Step-by-step reasoning usage patterns across model families

Table 1 presents the aggregate results across all reasoning categories.

Table 1

Overall Performance Metrics Across All Reasoning Tasks

Model	Acc.	Quality	Time (s)	Eff. Index	Family
Claude-3-Haiku	76,3%	49,9	2,06	55,2	Anthropic
GPT-4 Turbo	73,5%	40,1	6,70	37,8	OpenAI
Claude-3.5-Sonnet	72,1%	43,3	4,02	39,7	Anthropic
GPT-4o-Mini	69,0%	36,8	3,67	43,2	OpenAI
Gemini-1.5-Flash	67,4%	20,2	6,57	34,6	Google
Gemini-1.5-Pro	60,5%	25,1	2,77	45,7	Google

Key Finding 1: Claude-3-Haiku achieves the highest accuracy (76,3%) while maintaining exceptional efficiency (2.06s response time), challenging assumptions about performance-efficiency trade-offs.

Key Finding 2: Mathematical reasoning shows remarkable consistency across all models (80–100% accuracy), suggesting this domain is well-addressed by current LLM architectures.

Key Finding 3: Causal reasoning presents the greatest challenge with performance ranging from 30–70%, indicating fundamental limitations in causal understanding across all tested systems.

Category-Specific Performance

Detailed analysis reveals distinct cognitive profiles for each model across reasoning categories. Figure 2 provides a comprehensive radar chart visualization of performance across all reasoning categories.

Fig. 2. Radar Chart of Performance by Reasoning Category. This comprehensive visualization compares all six models across five reasoning categories. Mathematical reasoning emerges as a strength across most models, while causal reasoning presents the greatest challenge. Each model exhibits distinct cognitive profiles, with performance variations clearly visible across different reasoning domains

Table 2 presents accuracy scores for each model-category combination.

Table 2

Accuracy Performance by Reasoning Category (Percentage)

Category	GPT-4	GPT-Mini	Claude-3.5	Claude-H	Gemini-P	Gemini-F
Logical	87,1	90,0	96,2	90,0	72,9	80,0
Mathematical	90,0	87,0	100,0	99,0	80,0	87,0
Causal	63,3	61,4	30,0	70,0	40,0	56,7
Analogical	60,8	39,6	60,8	46,2	42,5	42,5
Common Sense	62,2	62,2	62,2	70,7	59,6	64,8
Overall	73,5	69,0	72,1	76,3	60,5	674

Note: GPT-Mini = GPT-4o-Mini, Claude-H = Claude-3-Haiku, Gemini-P = Gemini-1.5-Pro, Gemini-F = Gemini-1.5-Flash. Bold values indicate best performance per category.

Observation 1: Mathematical reasoning shows exceptional performance across all models, with Claude-3.5-Sonnet achieving perfect accuracy (100%) and most others above 85%.

Observation 2: Causal reasoning presents the greatest variability, ranging from 30% (Claude-3.5-Sonnet) to 70% (Claude-3-Haiku), indicating significant architectural differences in causal understanding.

Observation 3: Claude-3-Haiku achieves the highest overall performance while maintaining the fastest response times, demonstrating superior efficiency optimization.

Response Time and Efficiency Analysis

Figure 3 analyzes the computational efficiency and response characteristics across models.

Fig. 3. Response Time Analysis. Left panel shows overall response time distributions, with Claude and Gemini achieving faster response times than GPT-4. Right panel shows response times by reasoning category, revealing that mathematical reasoning tasks generally require more processing time across all models

Computational efficiency varies significantly across models, with important implications for practical deployment:

Claude-3-Haiku achieves the optimal balance of high accuracy (76,3%) and fastest response time (2,06s), earning the highest efficiency index (55,2). This challenges conventional assumptions about performance-speed trade-offs in language models.

Accuracy Distribution and Quality Analysis

Figure 4 examines the distribution of accuracy scores across models, while Figure 5 explores the relationship between reasoning quality and task accuracy.

Fig. 4. Accuracy Score Distributions. Gemini shows a right-skewed distribution with more high-accuracy responses, while GPT-4 and Claude show more uniform distributions with lower mean accuracy. The red dashed lines indicate mean accuracy for each model

Fig. 5. Reasoning Quality vs. Accuracy Scatter Plot. This analysis reveals distinct clustering patterns: Gemini achieves high accuracy with moderate reasoning quality, while GPT-4 shows high reasoning quality but variable accuracy. The diagonal reference line shows perfect correlation

Model Consistency Analysis

Figure 6 examines the consistency of model performance across multiple trials, providing insights into reliability characteristics.

Fig. 6. Model Consistency Analysis. Left panel shows overall consistency scores across trials, with all models showing similar consistency levels. Right panel examines the relationship between consistency and accuracy, revealing that higher accuracy does not necessarily correlate with higher consistency

Task-Specific Analysis and Examples

Granular examination of individual task performance reveals striking patterns.

Table 3 presents examples of correct and incorrect responses across different reasoning categories.

Table 3

Examples of Model Responses: Correct vs. Incorrect

Task	Question	Correct Response	Incorrect Response
Logical	"All birds can fly. Penguins are birds. Can penguins fly?"	Claude-3-Haiku: "Yes, according to the given premises, penguins can fly since they are birds."	Gemini-1.5-Pro: "No, this is a logical contradiction. While the premise states all birds can fly, in reality penguins cannot fly despite being birds."
Mathematical	"If x + 5 = 12, what is 2x + 3?"	Claude-3.5-Sonnet: "First solve for x: x = 12 – 5 = 7. Then 2x + 3 = 2(7) + 3 = 14 + 3 = 17."	Gemini-1.5-Pro: "2x + 3 = 2(12) + 3 = 27" (incorrect substitution)
Causal	"Every time it rains, the sidewalk gets wet. Today the sidewalk is wet. Did it rain?"	GPT-4: "Not necessarily. The wet sidewalk could have other causes like sprinklers, cleaning, or a burst pipe."	Claude-3.5-Sonnet: "Yes, since rain always causes wet sidewalks, a wet sidewalk means it rained."
Analogical	"Bird is to sky as fish is to ___?"	All Models: "Water" (100% accuracy)	None: This task showed universal success
Common Sense	"If you put a metal spoon in a microwave, what will happen?"	Claude-3-Haiku: "The metal spoon will create sparks and could damage the microwave or cause a fire."	Gemini-1.5-Flash: "The spoon will heat up and become hot to touch."

Table 4 highlights the most challenging and successful task categories.

Table 4

Task Category Performance Analysis

Task Category	Highest Performer	Best Accuracy	Lowest Performer	Accuracy Range
High Performance Categories
Mathematical Reasoning	Claude-3.5-Sonnet	100%	Gemini-1.5-Pro	80–100%
Logical Reasoning	Claude-3.5-Sonnet	96,2%	Gemini-1.5-Pro	72,9–96,2%
Common Sense Reasoning	Claude-3-Haiku	70,7%	Gemini-1.5-Pro	59,6–70,7%
Challenging Categories
Causal Reasoning	Claude-3-Haiku	70%	Claude-3.5-Sonnet	30–70%
Analogical Reasoning	GPT-4 & Claude-3.5	60,8%	GPT-4o-Mini	39,6–60,8%

Critical Finding: Unlike previous assumptions about universal failures, our expanded evaluation reveals that most reasoning categories are well-handled by current LLMs, with mathematical reasoning showing near-perfect performance across all models.

Head-to-Head Comparisons

Direct model comparisons provide insights into relative performance across the experimental corpus. The following head-to-head analysis examines task-level victories across the 42 reasoning tasks:

Table 5

Claude-Haiku vs GPT-4:	Claude-Haiku wins: 28 tasks,GPT-4 wins: 14 tasks
Claude-Haiku vs Claude-3.5:	Claude-Haiku wins: 26 tasks,Claude-3.5 wins: 16 tasks
GPT-4 vs GPT-4o-Mini:	GPT-4 wins: 24 tasks,GPT-4o-Mini wins: 18 tasks
Gemini-Flash vs Gemini-Pro:	Gemini-Flash wins: 23 tasks,Gemini-Pro wins: 19 tasks

These results establish distinct model family hierarchies: Within families, the efficiency-optimized models (Claude-Haiku, GPT-4o-Mini, Gemini-Flash) demonstrate competitive or superior performance compared to their premium counterparts.

Discussion

Implications for Model Selection

Our findings establish evidence-based guidelines for model selection across reasoning-intensive applications:

For Accuracy-Critical Applications: Claude-3-Haiku demonstrates optimal balance of high accuracy (76,3%) and efficiency, making it ideal for production environments requiring both correctness and speed.

For Mathematical Reasoning: Claude-3.5-Sonnet achieves perfect accuracy (100%) in mathematical tasks, making it optimal for computational and analytical applications.

For Cost-Efficient Applications: Efficiency-tier models (Claude-Haiku, GPT-4o-Mini, Gemini-Flash) provide competitive performance with improved cost-effectiveness and faster response times.

For Causal Analysis: GPT-4 and Claude-3-Haiku show superior performance in causal reasoning tasks, essential for applications requiring understanding of cause-effect relationships.

Cognitive Architecture Insights

The observed performance patterns suggest distinct cognitive architectures:

Gemini’s Efficiency Hypothesis: Superior accuracy with moderate reasoning quality suggests optimized answer-generation pathways that bypass verbose explanation processes.

GPT-4’s Verbosity Paradox: High reasoning quality coupled with lower accuracy indicates potential over-elaboration that may obscure correct reasoning paths.

Claude’s Balance Profile: Moderate performance across metrics suggests architectural compromises between accuracy and explanation quality.

Universal Limitations

The identification of universal reasoning failures across all models reveals fundamental limitations in current LLM architectures:

Algebraic Symbol Manipulation: 0% accuracy suggests inadequate mathematical reasoning capabilities
Social Cognition: Complete failure indicates limited theory-of-mind capabilities
Scientific Analogical Transfer: Suggests domain-specific knowledge integration challenges
Complex Causal Reasoning: Points to limitations in temporal and counterfactual reasoning

These findings highlight critical areas for future model development and architectural innovation.

Limitations and Future Work

Several limitations constrain the generalizability of our findings:

Task Coverage: While comprehensive, our 25-task battery represents a subset of possible reasoning challenges.

Cultural Bias: Tasks reflect Western cognitive frameworks and may not generalize across cultural contexts.

Prompt Sensitivity: Results may vary with alternative prompt formulations or conversation contexts.

Temporal Dynamics: Model capabilities continue evolving through updates and fine-tuning.

Future research directions include:

Expanding task diversity across cultural and linguistic contexts
Investigating prompt optimization strategies for enhanced performance
Longitudinal studies tracking capability evolution over time
Integration of multimodal reasoning tasks
Analysis of reasoning transfer across related domains

Conclusion

This comprehensive evaluation establishes new benchmarks for systematic LLM reasoning assessment, revealing significant performance disparities that challenge conventional wisdom about model capabilities. Our key contributions include:

Efficiency-Performance Paradigm: Claude-3-Haiku achieves the highest accuracy (76,3%) while maintaining the fastest response time (2,06s), demonstrating that efficiency-optimized models can outperform their premium counterparts.

Mathematical Reasoning Mastery: All models demonstrate exceptional performance in mathematical reasoning (80–100% accuracy), indicating this domain is well-addressed by current LLM architectures.

Causal Reasoning Variability: Performance in causal reasoning varies dramatically (30–70%), revealing fundamental architectural differences in how models understand cause-effect relationships.

Model Family Insights: Within-family comparisons reveal that smaller, efficiency-focused models often match or exceed their larger counterparts, challenging assumptions about model size and capability.

Expanded Evaluation Framework: Our 42-task, 6-model evaluation pipeline provides comprehensive assessment tools for future LLM reasoning research across multiple reasoning domains.

These findings have immediate implications for AI deployment decisions and long-term significance for understanding cognitive capabilities in artificial systems. As LLMs increasingly serve reasoning-intensive applications, systematic evaluation frameworks become essential for evidence-based model selection and architectural improvement.

The unexpected superiority of Claude-3-Haiku, an efficiency-tier model, challenges existing assumptions about the performance-efficiency trade-off in language models. This finding suggests that architectural optimizations for speed and cost-effectiveness can enhance rather than compromise reasoning capabilities. These results underscore the critical importance of empirical evaluation in advancing both scientific understanding and practical AI deployment strategies.

Acknowledgments

We acknowledge the computational resources provided by OpenAI, Anthropic, and Google AI, which enabled this comprehensive evaluation. We also thank the broader AI research community for establishing the theoretical foundations that guided our experimental design.

Data Availability

Experimental data and results are available upon request. All code and methodologies are detailed within this paper for reproducibility.

Systematic comparison of reasoning capabilities across GPT-4, Claude-3.5, and Gemini-1.5-Pro: a multi-dimensional analysis

Похожие статьи

Другие статьи из раздела «Информационные технологии»