Description
Evaluates an LLM's ability to accurately use defined functions to perform specific tasks, such as web searches, code execution, and planning multiple function calls. Input data is a conversation history, and a list of possible tools to use.
Provider
Prosus
Language
English
Evaluation
Multi-class classification accuracy using human-labeled data & Auto-evaluation with GPT-4 Turbo over ground-truth.
Data Statistics
Number of Samples788
Collection PeriodJanuary 2024 - May 2024
Function
Function Types the models were tested on.
Inference Method
Approach to querying the model for function use.

Results based on 0 entries.

Last updated: Invalid Date

#
Model
Provider
Size
Inference Method
Function Accuracy
Argument Correctness
No results.

Rows per page

Page 1 of 0

Examples

Function Misselection: Analyzing Olympic Sports Over Time

User message:

How many sports have been in the Olympics? How has this developed over time? Can you create a chart that illustrates this change over time clearly?

Correct tool:

web_search

Model's selection:

execute_python_code

Verdict:

False

Function Misselection: URL Handling for Biodiversity Presentation

User message:

Suggest 3 slides about https://en.m.wikipedia.org/wiki/Biodiversity%7Chttps://en.m.wikipedia.org/wiki/Biodiversity%3E for a school presentation (10 years old kids)

Correct tool:

save_url_as_file

Model's selection:

web_search

Verdict:

False

Function Argument Inaccuracy: Normal Distribution Peak Calculation

User message:

Calculate using python:
Given a normal distribution which spans from 0 to 8, and the total volume is 26, what is the highest peak? given these are 0.5 wide?

Model:

Tool: execute_python_code

import numpy as np
import scipy.stats as stats

# Define the parameters
start = 0
end = 8
volume = 26
width = 0.5

# Calculate the area under the curve
area = volume / width

# Calculate the standard deviation
std_dev = (end - start) / 4

# Calculate the peak value
peak = stats.norm.pdf(0, loc=(end-start)/2, scale=std_dev)

Have a unique use-case you’d like to test?

We want to evaluate how LLMs perform on your specific, real world task. You might discover that a small, open-source model delivers the performance you need at a better cost than proprietary models. We can also add custom filters, enhancing your insights into LLM capabilities. Each time a new model is released, we'll provide you with updated performance results.

Leaderboard

An open-source model beating GPT-4 Turbo on our interactive leaderboard.

Don’t worry, we’ll never spam you.

Please, briefly describe your use case and motivation. We’ll get back to you with details on how we can add your benchmark.