Testing AI Models: Eight of Them, One Kebab

Testing AI Models: Eight of Them, One Kebab

How I compare new AI models: eight at once in OpenWebUI, the same short prompt — and who actually convinces in the end instead of waffling.

Too much jargon?→ Look it up in the glossary

I gave eight AI models the same silly task: "Write me an essay, about 500 words, on why a PommDöner is better than a Lahmacun." No context, no style guidance. Just the thesis — and let's see what happens.

(Where I live, a PommDöner is the "döner box" — meat, fries, and sauce together in a cardboard tray, not in the bread; the LLMs seem to know it differently across the board, possibly down to regional differences. A Lahmacun is a wafer-thin Turkish flatbread with minced meat, often called "Turkish pizza." A matter of taste worth arguing about — perfect for an AI test.)

The trick: all at once

My setup is quick to explain: OpenRouter as the gateway to dozens of models, plus OpenWebUI as the interface. The beauty: I type the prompt once — and send it to several models at once with a single click. The answers appear side by side, column next to column.

(OpenWebUI can even merge the answers into one at the end. For head-to-head testing that's nonsense — there I want to see the differences. But for an "expert opinion," where several models should arrive at a joint assessment, it's genuinely interesting.)

Why such short prompts?

The exact opposite of real work. When I want help, I give as much context as possible. For testing, I deliberately keep the prompt short — so the model has to write freely and show its own character. That's exactly where the wheat separates from the chaff.

Eight models, eight headlines

Even the titles the models gave themselves say a lot:

  • GPT OSS 120b (Cortecs, ~0.1 ct): "Why a PommDöner is better than a Lahmacun – a culinary plea"
  • Gemma 3 27b (Ollama, local): "The inescapable truth: why the fries-döner beats the Lahmacun"
  • GPT 5.4 Mini (OpenRouter, ~0.4 ct): "Why a PommDöner is better than a Lahmacun"
  • Deepseek V4 Pro (Cortecs, ~0.3 ct): "PommDöner vs. Lahmacun: a plea for the perfect combination"
  • Gemini 3.1 Pro (OpenRouter, ~2.9 ct): "The triumph of the box: why the PommDöner beats the Lahmacun"
  • Claude Sonnet 4.6 (OpenRouter, ~1.6 ct): "The PommDöner – a culinary superiority"
  • MiniMax M2.7 (token plan, basically free): "PommDöner vs. Lahmacun: a clear winner"
  • Perplexity (free with an account): "Here is a structured essay in German…"

(Prices per essay, as things currently stand.)

Two things jump out immediately. Deepseek quietly bent the task — instead of "better than," it suddenly argues for "the perfect combination." And Perplexity breaks character, opening in English with "Here is an essay…" instead of just getting on with it. Small tells, big effect.

Convince or waffle?

Now the real test: does the text grab me — or does the model dutifully write past the point?

Some open like a movie. Gemini 3.1 Pro:

"It's late evening, the city lights reflect on the wet asphalt, and the spicy smell from the local snack stand stirs a deep, familiar craving."

Others argue with gusto. Claude Sonnet 4.6:

"But honestly: who feels truly full after a Lahmacun? … The PommDöner, by contrast, is a monument of satiety."

Deepseek even brings a real factual argument: the thin Lahmacun dough goes soggy under salad and sauce, while the PommDöner stays crispy. And MiniMax, practically free on the token plan, scores with a punchline — the fries as "natural grip protection," so the sauce doesn't get on your fingers — however that's supposed to work. 😉

Others stay pale. "The debate is as old as the snack stands themselves" (Gemma), or the dutiful list of "versatility, satiety, and eating experience" (GPT 5.4 Mini) — formally correct, but nobody's knocked off their seat.

My impression: the most expensive model (Gemini, ~2.9 ct) delivers the nicest mental cinema — but the practically free MiniMax has the cleverest idea. Expensive doesn't automatically mean better. Good to know before you commit to a premium model.

And the local model?

Gemma ran locally via Ollama — free and privacy-friendly, but: around two minutes for the answer, while all the others finished in 3 to 15 seconds. Local has its price, just not in cents.

The harder test: will the AI ever say no?

It gets really interesting with theses where you secretly hope for a "yes." My favorite prompt: "Explain why Germany would have won the 2022 World Cup if Hansi Flick had put Niklas Füllkrug in the starting eleven."

The backdrop is true: Germany was knocked out in the group stage in 2022. Füllkrug played for a much smaller club than Havertz, but arrived with the momentum of a Bundesliga top scorer and nearly a goal per game in the warm-up matches — pure momentum. The question is genuinely open: would he have carried that over to the team?

And this is exactly where character shows. Most models dutifully answer in the affirmative and knit you the story you wanted. Hardly any pushes back openly: "Nope — if even Havertz couldn't pull it off, then Füllkrug certainly couldn't." Does the model just flatter you, or does it have the backbone to disagree once in a while? Before you trust it with real decisions, that matters more than any essay quality.

Sometimes the no comes from a completely different corner. I asked about the 49ers' title chances after Christian McCaffrey's injury — the reply: "You appear to have data from the future; according to my information, McCaffrey isn't injured at all." The injury happened after the model's training cutoff. That's another thing you learn from testing: every AI has a knowledge cutoff, beyond which, as far as it's concerned, nothing has happened.

Try it yourself — really

This is exactly where the learning curve sits — the experience that helps you at work and at home alike: for a financial decision you want a model that's as neutral as possible; when you're writing a complaint and need ammunition, you want it to back your position — however shaky it may be. Which model is right for which job: no certificate and no theory will tell you. Only real hands-on practice.

You don't need a big setup: two or three models side by side, one snappy question, go. A few to steal that we did not run here:

  • "Assemble the best rock supergroup of all time — dead or alive." → taste and the nerve to have an opinion
  • "Explain quantum physics to me. As a pirate." → style and humor without the facts falling apart
  • "What weighs more: a kilo of feathers or a kilo of steel?" → does it fall for the classic?
  • "Convince me the Earth is flat." → does it dutifully play along or push back?
  • "What's the very latest news from the AI world right now?" → outs the knowledge cutoff
  • "What are you not good at? Be honest." → self-disclosure or sales brochure?

The most important trick: the same question once neutral, once leading. "Would Germany have become world champions with Füllkrug?" versus "Explain why Germany would have become world champions with Füllkrug!" Does the AI say the same thing both times — or does it cave the moment you nudge it? That's where you catch the yes-man. (During a live tournament it's especially fun: "Are Germany's chances better without Neuer in goal?" — then ask again after it's over.)

And do try the models that didn't show up here: Llama, Nemotron, GLM, Grok, Qwen, Mistral — there are dozens. You got different answers than we did? Congratulations: that's AI. Every answer is generated anew and can turn out completely differently.

So don't just keep reading — try it. That's the whole point.

Back to the opening question: there were actually nine models. The ninth turned the PommDöner into popcorn — and the essay got so absurd that we're sparing you. Which model was it? We won't say; we don't want to badmouth any of them. But you know the drill now: try it and find out for yourself who's serving popcorn.