Description
Evaluates the ability of models to interpret and understand food images from delivery applications through two distinct tasks: detailed caption generation and image quality assessment. Models must produce accurate and detailed captions describing dishes, ingredients, and presentation without hallucinating information. Additionally, they must identify relevant image-quality issues from a standardized set of labels, such as text overlays, human presence, unappealing presentation, or other similar factors.
Provider
iFood and Glovo
Language
English
Evaluation
Auto-evaluation with GPT-4o over reference captions for measuring the quality of the generated captions. F1 score is used to evaluate model's capability to correctly label issues in the image.
Data Statistics
Number of Samples149
Collection PeriodMarch 2025

Results based on 0 entries.

Last updated: Invalid Date

#
Model
Provider
Size
Caption Score
F1 Score
No results.

Rows per page

Page 1 of 0

Have a unique use-case you’d like to test?

We want to evaluate how LLMs perform on your specific, real world task. You might discover that a small, open-source model delivers the performance you need at a better cost than proprietary models. We can also add custom filters, enhancing your insights into LLM capabilities. Each time a new model is released, we'll provide you with updated performance results.

Leaderboard

An open-source model beating GPT-4 Turbo on our interactive leaderboard.

Don’t worry, we’ll never spam you.

Please, briefly describe your use case and motivation. We’ll get back to you with details on how we can add your benchmark.