ProLLM Leaderboards

StackUnseen

Evaluates an LLM's ability to answer recent StackOverflow questions, highlighting its effectiveness with new and emerging content.

#NameProviderAcceptance
1
GPT-4.5OpenAI0.954
2
GPT-4.1 MiniOpenAI0.954
3
O1 PreviewOpenAI0.938

StackEval

Evaluates an LLM's capability to function as a coding assistant by answering a variety of coding-related questions across different programming languages and question types.

#NameProviderAcceptance
1
GPT-4.1OpenAI0.986
2
GPT-4.1 MiniOpenAI0.984
3
O1 PreviewOpenAI0.981

Image Understanding

Evaluates the ability of models to interpret and understand food images from delivery applications through two distinct tasks: detailed caption generation and image quality assessment. Models must produce accurate and detailed captions describing dishes, ingredients, and presentation without hallucinating information. Additionally, they must identify relevant image-quality issues from a standardized set of labels, such as text overlays, human presence, unappealing presentation, or other similar factors.

#NameProviderCaption ScoreF1 Score
1
O1OpenAI0.8860.323
2
GPT-4oOpenAI0.8790.362
3
GPT-4.1OpenAI0.8390.459

Q&A Assistant

Evaluates an LLM's effectiveness as a team member in a business environment by assessing its ability to provide accurate and contextually relevant responses. It utilizes diverse queries covering both technical (such as coding) and non-technical areas.

#NameProviderAcceptance
1
O1 MiniOpenAI0.989
2
O3 Mini (High)OpenAI0.982
3
GPT-4.1 MiniOpenAI0.978

Summarization

Evaluates an LLM's ability to accurately summarize long texts from diverse sources such as YouTube video transcripts, websites, PDFs, and direct text inputs. It also assesses the model's capacity to follow detailed user instructions to extract specific data insights. The dataset consists of 41 unique entries in English, which have been translated into Afrikaans, Brazilian Portuguese, and Polish using machine translation.

#NameProviderAccuracy
1
GPT-4.1OpenAI0.867
2
GPT-4.1 MiniOpenAI0.835
3
O1OpenAI0.823

Function Calling

Evaluates an LLM's ability to accurately use defined functions to perform specific tasks, such as web searches, code execution, and planning multiple function calls. Input data is a conversation history, and a list of possible tools to use.

#NameProviderAccuracy
1
GPT-4oOpenAI0.825
2
GPT-4.1OpenAI0.824
3
Gemini-2.0 FlashGoogle0.820

OpenBook Q&A

Evaluates an LLM's ability to answer questions based on provided context, extracted from files.

#NameProviderRelevance
1
QwQ-32BAlibaba0.919
2
Deepseek V3DeepSeek AI0.851
3
GPT-4.1 MiniOpenAI0.851

Entity Extraction

Evaluates an LLM's ability to identify and extract specific entities from ad descriptions, given predefined definitions and potential values for each entity.

#NameProviderF1 Score
1
GPT-4oOpenAI0.854
2
Mistral Small 3Mistral0.853
3
MiniMax-Text-01Minimax0.851

SQL Disambiguation

Evaluates an LLM's ability to disambiguate user requests for generating SQL queries based on the given business rules and database schema. A question can either be answered using the schema, a combination of the schema and the business rules, or requires additional information to be answered.

#NameProviderAccuracy
1
GPT-4.5OpenAI0.531
2
O1OpenAI0.492
3
Grok 2xAI0.479

LLM-as-a-Judge

Evaluates an LLM's ability to judge the acceptability of other LLM answers to given technical and non-technical questions, including some coding questions.

#NameProviderAccuracy
1
GPT-4.1OpenAI0.853
2
GPT-4oOpenAI0.846
3
GPT-4 TurboOpenAI0.838

Transcription

Evaluates transcription models on multi-lingual, multi-speaker audio with varying levels of background noise across multiple business domains such as software-development, finance, classifieds, food-delivery, and healthcare. The dataset consists of 150 unique audio samples, with each sample being augmented to generate a low and a high noise version.

#NameProviderAccuracy
1
Whisper Large-v3OpenAI0.779
2
Gemini-1.5-FlashGoogle0.717
3
Gemini-1.5-ProGoogle0.708

Have a unique use-case you’d like to test?

We want to evaluate how LLMs perform on your specific, real world task. You might discover that a small, open-source model delivers the performance you need at a better cost than proprietary models. We can also add custom filters, enhancing your insights into LLM capabilities. Each time a new model is released, we'll provide you with updated performance results.

Leaderboard

An open-source model beating GPT-4 Turbo on our interactive leaderboard.

Don’t worry, we’ll never spam you.

Please, briefly describe your use case and motivation. We’ll get back to you with details on how we can add your benchmark.