LLM Benchmarks
With all the models coming, the field of benchmarks and evals are used as a north light to showcase the progress of newly released models.
I have created a benchmark aggregator to get all the major benchmarks under one roof.

China vs AI
Deepseek in December 2024 was a pivotal moment for Chinese AI. A lot of people still think that it is the only good Chinese Model but they are just so many at such a great price that USA models just might loose in the price/performance war
Kimi K2
Developed by Moonshot.ai. Use here Kimi Chat
It is one of the best general purpose model It provides great multi-modal features such as creating presentations, slides.
It’s slides is powered by Nano Banana and in my opinion is the best info graphic slide making tool out there.
It spawn agents to aggregate from internet, builds a comprehensive outline.

Z.ai’s GLM
Try it here: https://chat.z.ai/
It is dirt cheap model with good performance in coding tasks. Many people use this as a coding assistant whilst using something like GPT5.2 or Opus/Sonnet 4.5 for orchestration.
Performance numbers are on-par with SOTA, especially for coding tasks.
Real world usage might differ, but the chatter on X/Twitter seems to be it is a great model for low hanging fruits considering the cost.
Performance:

Pricing:

Base plan for Z.ai GLM is 3$/month, whilst for ChatGPT or Claude is 20$/month!
Minimax
Try it here: https://agent.minimax.io/
New kid in the block that is rising in the benchmarks. It is great for long-horizon tasks and is cheaper than Anthropics’s Claude and Sonnet.
It also has great multi-model capabilities in:
-
Text
-
Video
-
Music
-
Speech
-
Agent
Benchmarks
Knowledge
1. Humanity’s Last exam:
Consist of questions across physics, medicine, humanities, computer science, engineering etc.
2. GPQA diamond
Multiple choice questions with questions in STEM. Questions are written by PHD candidate in relevant fields.
Coding
1. SWE Bench Pro
Realistic evaluation of AI agents for software engineering. More sophisticated from the SWE bench with harder tasks using private repos and overfitting checks
2. Terminal Bench
Testing AI agents in terminals. These are coding task that function inside a terminal with TUI/CLI tools such as Factory, Codex etc.

Visual and Language Reasoning
1. MMMLU-Pro
A benchmark with 12k questions for 14 subject areas. These are graduate level questions, hence it requires reasoning to get the correct answer.
2. MMMU Pro
This is a multi-modal benchmark that takes both visual and textual reasoning into picture.
One of the question is to answer questions from an image, making it more rigorous in real-world applications.
Benchmaxxing
In statistics and machine learning, any model has the risk of being overfitted. In simple terms, it fits perfectly to training data, but when it sees new data, its perforamnce degrades rapidly.
LLMs have been notorious for treating benchmarks as ways to optimize its numbers for benchmarks but fail spectacularly in the real world.
Llama 4 Maverick Benchmaxxing
When LLama 4 arrived, it crushed the benchmarks. It was placed #2 in the LMSYS chatbot arena. It was a human evaluation when 2 models are pitted against each other to the user, and the user has to decide which one is better. Meta used a specfic version which used more emojis and be more polite. When the actual version was tested, it dropped from #2 to #38.
Qwen
Qwen has been one of those models that under-delivers. Yes, the models are small, cheap and to an extent useful, but they are no where near to State-of-the-art. Still, it is important to Qwen to exist as it is very light and can run on local machines (including mobiles) easily.
Closing thoughts
Deepseek was just a start, and the Chinese AI models are building both cheaper models with useful features.