Function Calling

Description

Evaluates an LLM's ability to accurately use defined functions to perform specific tasks, such as web searches, code execution, and planning multiple function calls. Input data is a conversation history, and a list of possible tools to use.

Provider

Prosus

Language

English

Evaluation

Multi-class classification accuracy using human-labeled data & Auto-evaluation with GPT-4 Turbo over ground-truth.

Data Statistics

Number of Samples788

Collection PeriodJanuary 2024 - May 2024

Function

Function Types the models were tested on.

Inference Method

Approach to querying the model for function use.

Results based on 0 entries.

Last updated: Invalid Date

#	Model	Provider	Inference Method	Function Accuracy	Argument Correctness
No results.

Rows per page

Page 1 of 0

Examples

Function Misselection: Analyzing Olympic Sports Over Time

User message:

How many sports have been in the Olympics? How has this developed over time? Can you create a chart that illustrates this change over time clearly?

Correct tool:

web_search

Model's selection:

execute_python_code

Verdict:

False

Function Misselection: URL Handling for Biodiversity Presentation

User message:

Suggest 3 slides about https://en.m.wikipedia.org/wiki/Biodiversity%7Chttps://en.m.wikipedia.org/wiki/Biodiversity%3E for a school presentation (10 years old kids)

Correct tool:

save_url_as_file

Model's selection:

web_search

Verdict:

False

Function Argument Inaccuracy: Normal Distribution Peak Calculation

User message:

Calculate using python:
Given a normal distribution which spans from 0 to 8, and the total volume is 26, what is the highest peak? given these are 0.5 wide?

Model:

Tool: execute_python_code

import numpy as np
import scipy.stats as stats

# Define the parameters
start = 0
end = 8
volume = 26
width = 0.5

# Calculate the area under the curve
area = volume / width

# Calculate the standard deviation
std_dev = (end - start) / 4

# Calculate the peak value
peak = stats.norm.pdf(0, loc=(end-start)/2, scale=std_dev)