Home » Ship with ChatGPT 🌟 AI Research · Newsletter · ML Labs · About

Benchmarking LLM for business workloads

I've been benchmarking various LLM models on business workloads for quite some time.

llm-benchmarks-2024-10.png

These benchmarks are based on a private collection of prompts and tests that were extracted from real products and AI cases. They don't test how well LLMs talk, but rather how accurately they accomplish various tasks relevant for business process automation.

My findings are documented in a series of monthly reports that are published on TimeToAct / Trustbit websites.

  • September 2024 - Chat GPT-o1, Gemini 1.5 Pro v 002, Qwen 2.5, Llama 3.2, Local LLM trends over time
  • August 2024 - Enterprise RAG Challenge
  • July 2024 - Codestral Mamba 7B, GPT-4o Mini, Meta Llama 3.1, Mistal
  • Juny 2024 - Claude 3.5 Sonnet, Confidential Computing, Local LLM Trend
  • May 2024 - Gemini 1.5 0514, GPT-4o, Qwen 1.5, IBM Granite
  • April 2024 - Gemini Pro 1.5, Command-R, GPT-4 Turbo, Llama 3, Long-term trends
  • March 2024 - Anthropic Claude 3 models, Gemini Pro 1.0
  • February 2024 - GPT-4 0125, Anthropic Claude v2.1, Mistral flavours
  • January 2024 - Mistral 7B OpenChat v3
  • December 2023 - Multilingual benchmark, Starling 7B, Notus 7B and Microsoft Orca
  • November 2023 - GPT-4 Turbo, GPT-3 Turbo
  • October 2023 - New Evals, Mistral 7B
  • September 2023 - Nous Hermes 70B
  • August 2023 - Anthropic Claude v2, Llama 2, ChatGPT-4 0613
  • July 2023 - GPT-4, Anthropic Claude, Vicuna 33B, Luminous Extended

Published: October 17, 2024.

🤗 Check out my newsletter! It is about building products with ChatGPT and LLMs: latest news, technical insights and my journey. Check out it out