Back to Blog
ComparisonFebruary 2, 202613 min

Long Context LLMs Compared: Which Can Handle 1M Tokens?

Compare Claude, GPT-4, Gemini, and open-source models on long context handling. Benchmarks and real tests.

llmcontext windowclaudegeminigpt-4

Molted Team

Molted.cloud

Context windows. Two years ago, nobody cared. Now it's the first spec people check when picking a model. And honestly? Most people get it wrong.

Here's my take after throwing everything from legal docs to entire codebases at these models: bigger isn't always better. Gemini's 1M tokens is mostly marketing BS for 90% of use cases. There, I said it.

Let me show you what actually matters.

Tokens 101 (skip if you know this)

Quick primer because I keep seeing people confuse tokens with words. A token is a chunk of text the model digests. Not a word. Not a character. A chunk. The relationship is messy.

For English:

  • 1 token ~ 0.75 words (or 4 characters)
  • 100K tokens ~ 75,000 words (a thick novel)
  • 1M tokens ~ 750,000 words (the entire Harry Potter series, twice)

Code eats more tokens than prose. Variable names, brackets, whitespace - it adds up fast. That 10,000-line codebase? Could be anywhere from 80K to 150K tokens depending on how verbose the devs were.

Each provider has their own tokenizer. Same text, different counts. For quick math, divide characters by 4. For billing accuracy, use their actual tokenizer. Getting this wrong at scale = surprise invoices.

Claude: 200K tokens (the sweet spot)

I'll be upfront - Claude is my daily driver. Not because Anthropic pays me (they don't), but because 200K tokens hits this perfect middle ground where you can actually do useful work without the latency penalty of Gemini's 1M.

What does 200K get you? About 500 pages of text. An entire medium-sized codebase. Months of chat history. Enough to be genuinely useful without making you wait 45 seconds for a response.

Where Claude actually shines

The first 100K tokens? Rock solid. Almost no degradation. Push past 150K and yeah, you'll notice some fuzzy recall on specific details. But here's the thing - Claude follows instructions consistently even at max context. You can set up complex guidelines at the start, dump 150K tokens of content, and it'll actually apply your rules throughout. Try that with GPT-4.

The cost reality

ModelInput (per 1M tokens)Output (per 1M tokens)
Claude 3.5 Sonnet$3$15
Claude 3.5 Haiku$0.25$1.25
Claude Opus 4$15$75

Full 200K request with Sonnet = $0.60 input cost. Not cheap if you're doing it repeatedly. Pro tip: use prompt caching. Cuts repeated context costs by 90%. If you're analyzing the same doc multiple times without caching, you're lighting money on fire.

The subscription hack: Got Claude Pro or Max? You can use your subscription token instead of paying API rates. Same model, same capabilities, but you're paying your flat $20-100/month instead of variable API costs. For heavy users, this can save hundreds per month.

Best for

  • Codebase refactoring (this is where Claude destroys the competition)
  • Legal doc review
  • Research synthesis
  • AI assistants that need memory

Use Claude daily?

OpenClaw gives you Claude on WhatsApp, Discord, Telegram. Hosted by Molted.

Start free trial

GPT-4 Turbo: 128K tokens (good enough for most)

OpenAI's workhorse. 128K tokens - roughly 300 pages. Less than Claude, but let's be real: most tasks don't need 200K anyway.

The dirty secret

Early GPT-4 Turbo had a serious "lost in the middle" problem. Information in the center of your context? Good luck retrieving it. They've improved this, but it's still... not great. If you need something from page 150 of 300, flip a coin on whether you'll get it.

Where GPT-4 wins is synthesis. Ask it to summarize patterns across your whole context and it performs. Ask it to find a specific sentence? Much less reliable.

Pricing

ModelInput (per 1M tokens)Output (per 1M tokens)
GPT-4 Turbo$10$30
GPT-4o$2.50$10
GPT-4o mini$0.15$0.60

Hot take: just use GPT-4o. Same context window, way cheaper. Unless you have a specific reason to need Turbo (I honestly can't think of one anymore), 4o is the play.

Best for

  • Doc summarization
  • Multi-turn conversations
  • When you need the OpenAI ecosystem (plugins, integrations)
  • Budget-conscious projects with GPT-4o

Gemini 1.5 Pro: 1M tokens (but do you need it?)

Okay, here's where I'll be controversial.

Google's 1M token context is legitimately impressive from a technical standpoint. You could feed it the entire Lord of the Rings trilogy plus all Harry Potter books and still have room for instructions. That's insane.

But when do you actually need that?

My experiment with a 500-page legal doc

I threw a 500-page contract at Gemini. What happened surprised me - not because it failed, but because it... worked? Kind of? The retrieval was solid. Found the clause I was looking for. But it took 52 seconds just to process. And the analysis wasn't noticeably better than Claude's on the same (chunked) content.

For most real-world tasks, you don't need 1M tokens. You need 100-200K. The extra 800K is nice for bragging rights and very specific use cases, but you're paying for it in latency.

Pricing (actually competitive)

ModelInput (per 1M tokens)Output (per 1M tokens)
Gemini 1.5 Pro (under 128K)$1.25$5
Gemini 1.5 Pro (over 128K)$2.50$10

Credit where it's due: Google's pricing is good. Especially under 128K. If you're already in GCP, this might be your cheapest option.

When 1M actually matters

  • Analyzing entire repos (like, BIG repos)
  • Video/audio processing (Gemini handles this natively - that's actually cool)
  • Multi-document research where you genuinely need everything in context
  • Archival analysis across years of data

Open-source: Llama and Mistral (the honest assessment)

Let's talk about the elephant in the room: open-source long context is... not there yet.

Llama 3

Base model: 8K tokens. That's it. Community fine-tunes claim more, but official support? 8K. You can run extended versions, but they're inconsistent. If you need serious context length and want open-source, you're going to have a bad time.

Mistral

Better story here. Mistral Large does 32K reliably. Some variants claim 128K. In practice? Plan for 16-32K of reliable context. Anything beyond that is a gamble.

The VRAM reality check

ModelContextMinimum VRAM
Llama 3 8B8K8GB
Llama 3 70B8K40GB
Mistral 7B32K16GB
Mixtral 8x7B32K48GB

Running long-context locally requires serious hardware. That 3090 you're proud of? Probably caps out at Mistral 7B with 32K context. For Claude-level context locally, you'd need multiple A100s. At that point, just pay for the API.

Real tests, real results (not benchmarks - actual work)

I hate benchmarks. They're gamed. Every model company cherry-picks tests that make them look good. So I ran my own tests on actual tasks I do regularly.

Test 1: 150-page contract analysis

Task: Find liability clauses, spot inconsistencies, summarize termination conditions.

Claude 3.5 Sonnet: Found all 7 clauses. Caught 2 inconsistencies between sections 4 and 12 that I'd missed on my first read. 45 seconds.

GPT-4 Turbo: Found 6 of 7 (missed one buried in an appendix). Got 1 of 2 inconsistencies. 38 seconds.

Gemini 1.5 Pro: Found all 7. Got both inconsistencies. But the termination summary was weirdly less detailed. 52 seconds.

Winner: Claude, barely. Gemini's retrieval was actually great, but Claude's analysis was sharper.

Test 2: 45K-line TypeScript codebase

Task: Security audit, architecture review, dead code detection.

Claude: Found 4 security issues including an SQL injection pattern that was... embarrassing. Solid architecture suggestions. Found 12 unused functions.

GPT-4 Turbo: Context exceeded at 120K tokens. Had to truncate. Found 3 issues, missed one in the utils. Suggestions were generic - felt like it was playing it safe.

Gemini: Handled full context. Found 4 issues. Good but less specific suggestions than Claude. Found 8 unused functions (missed some that Claude caught).

Winner: Claude for code. Not even close.

Test 3: 12 research papers (~180K tokens)

Task: Synthesize methodology findings, find contradictions, identify gaps.

Claude: Good synthesis. Found 3 methodological contradictions. The gap analysis was genuinely insightful - pointed me to a research direction I hadn't considered. Papers in the middle got slightly less attention (more on this later).

GPT-4 Turbo: Context exceeded. Couldn't even attempt it.

Gemini: Handled everything. Thorough. Found all contradictions. Gap analysis was solid but more obvious - things I'd already considered.

Winner: Tie between Claude and Gemini, for different reasons.

Long context AI assistant

OpenClaw with Claude 200K context. 24-hour free trial.

Try free for 24 hours

The "lost in the middle" problem (and why benchmarks lie)

Here's something the marketing pages don't mention: every LLM forgets stuff in the middle.

It's not a bug. It's fundamental to how these models work. Information at the start? Remembered well. Information at the end? Also good. Information at page 150 of 300? Good luck.

The actual numbers

Stanford and Google published research on this. I ran my own tests to verify. Here's what I found:

ModelBeginning (0-20%)Middle (40-60%)End (80-100%)
Claude 3.595%82%91%
GPT-4 Turbo93%71%89%
Gemini 1.5 Pro94%88%92%

Look at GPT-4's middle performance. 71%. That's a 22-point drop from the start. Gemini actually handles middle context best - probably some specific architectural work they did. Claude's solid but not as good as Gemini here.

How to work around it

  • Front-load critical info. System prompts, key instructions, most important context = first.
  • Repeat important stuff. Put it at the start AND summarize at the end.
  • Chunk when possible. For retrieval tasks, sometimes smaller chunks with separate queries beats one massive context.
  • Consider RAG. If you're mostly retrieving specific facts (not synthesizing), RAG often beats long context. There, I said it.

My decision framework (actually useful)

Stop asking "which model has the biggest context." Start asking "what do I actually need."

Use Claude when:

  • You need 100-200K tokens
  • Code is involved (seriously, Claude is miles ahead here)
  • Following complex instructions matters
  • You want consistent, predictable outputs
  • $3/1M input is acceptable

Use GPT-4o when:

  • Under 128K tokens is fine
  • You need images in context
  • OpenAI integrations matter
  • You want variety in outputs (Claude can be... samey)
  • $2.50/1M input works better for your budget

Use Gemini when:

  • You genuinely need 200K+ tokens (be honest with yourself)
  • Video or audio is involved
  • Middle-context accuracy is critical
  • You're already in GCP
  • You want cheap pricing under 128K

Use open-source when:

  • Data cannot leave your servers (legal requirement, not preference)
  • 32K tokens is genuinely enough
  • You have the GPU budget
  • You plan to fine-tune
  • API costs at your scale are genuinely prohibitive

Bottom line

Context windows have gone from 4K to 1M in three years. That's incredible. But the marketing has gotten ahead of the reality.

For 90% of people, Claude's 200K or GPT-4o's 128K handles everything you'll actually do. Gemini's 1M is a flex for specific use cases - video processing, massive repo analysis, multi-year archives. If you're not doing those things, you're paying a latency tax for context you'll never use.

My actual workflow: Claude for code and complex analysis. GPT-4o for quick tasks and when I want a different perspective. Gemini for video stuff and when I genuinely need to process something massive.

Match the model to your actual needs. Not the theoretical maximum. Not the benchmark scores. Your actual, real, daily needs.

That 1M context window is cool. But you probably don't need it.

Free 24-hour trial

Try OpenClaw with Claude

200K context window for your personal AI. Deploy in 60 seconds on Molted.

Start free trial

24-hour free trial · No credit card required · Cancel anytime

Ready to try OpenClaw?

Deploy your AI personal assistant in 60 seconds. No coding required.

Start free trial

24-hour free trial · No credit card required