Function Calling
Description
Evaluates an LLM's ability to accurately use defined functions to perform specific tasks, such as web searches, code execution, and planning multiple function calls. Input data is a conversation history, and a list of possible tools to use.Provider
ProsusLanguage
EnglishEvaluation
Multi-class classification accuracy using human-labeled data & Auto-evaluation with GPT-4 Turbo over ground-truth.Data Statistics
Number of Samples788
Collection PeriodJanuary 2024 - May 2024
Function
Function Types the models were tested on.
Inference Method
Approach to querying the model for function use.
Results based on 0 entries.
Last updated: Invalid Date
# | Model | Provider | Size | Inference Method | Function Accuracy | Argument Correctness |
---|---|---|---|---|---|---|
No results. |
Rows per page
Page 1 of 0
Examples
Function Misselection: Analyzing Olympic Sports Over Time
User message:
How many sports have been in the Olympics? How has this developed over time? Can you create a chart that illustrates this change over time clearly?
Correct tool:
web_search
Model's selection:
execute_python_code
Verdict:
False
Function Misselection: URL Handling for Biodiversity Presentation
User message:
Suggest 3 slides about https://en.m.wikipedia.org/wiki/Biodiversity%7Chttps://en.m.wikipedia.org/wiki/Biodiversity%3E for a school presentation (10 years old kids)
Correct tool:
save_url_as_file
Model's selection:
web_search
Verdict:
False
Function Argument Inaccuracy: Normal Distribution Peak Calculation
User message:
Calculate using python:
Given a normal distribution which spans from 0 to 8, and the total volume is 26, what is the highest peak? given these are 0.5 wide?
Model:
Tool: execute_python_code
import numpy as np
import scipy.stats as stats
# Define the parameters
start = 0
end = 8
volume = 26
width = 0.5
# Calculate the area under the curve
area = volume / width
# Calculate the standard deviation
std_dev = (end - start) / 4
# Calculate the peak value
peak = stats.norm.pdf(0, loc=(end-start)/2, scale=std_dev)