The FinBen: An Holistic Financial Benchmark for Large Language Models

Recent studies have shown the great potential of advanced LLMs such as GPT-4 on financial text analysis and prediction tasks in the financial domain. While their potential is evident, a comprehensive understanding of their capabilities and limitations for finance, remains largely unexplored.

Existing financial domain evaluation benchmarks including FLUE, BBTCFLEB, and PIXIU, have a limited scope and are solely focused on financial NLP tasks, primarily targeting language understanding abilities where LLMs have already been extensively evaluated. These benchmarks fail to capture other crucial facets of the financial domain, such as comprehending and extracting domain-specific financial knowledge and resolving realistic financial tasks. As such, their efficacy in evaluating and understanding LLM performance is limited.

To bridge this gap, the paper proposes FinBen, the first comprehensive open-sourced evaluation benchmark, specifically designed to thoroughly assess the capabilities of LLMs in the financial domain. FinBen encompasses 35 datasets across 23 financial tasks, organized into three spectrums of difficulty inspired by the CattellHorn-Carroll theory, to evaluate LLMs’ cognitive abilities in inductive reasoning, associative memory, quantitative reasoning, crystallized intelligence, and more.

Evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals insights into their strengths and limitations within the financial domain. The findings indicate that GPT-4 leads in quantification, extraction, numerical reasoning, and stock trading, while Gemini shines in generation and forecasting; however, both struggle with complex extraction and forecasting, showing a clear need for targeted enhancements. Instruction tuning boosts simple task performance but falls short in improving complex reasoning and forecasting abilities.

Paper : https://arxiv.org/pdf/2402.12659.pdf

Code : https://github.com/The-FinAI/PIXIU