I ported one program to 10 languages to see how an LLM thinks
I asked Claude to port a 500-line Go bank statement analyzer to 8 languages. Same logic, same tests, different stdlib. The goal wasn't the code — it was watching the LLM's internal process: where it hesitated, where it didn't, and whether it could predict its own performance.
It could.
The scoreboard
| Language | Lines | Deliberation | Fixups | Web searches | Confidence |
|---|---|---|---|---|---|
| Python + uv | 253 | 2% | 0 | 0 | 100% |
| Nim | 299 | 20% | 1 | 0 | 95% |
| F# | 353 | 15% | 0 | 0 | 95% |
| Crystal | 363 | 15% | 0 | 0 | 93% |
| Odin | 432 | 80% | 2 | 4 | 70% |
| Java 8 | 495 | 3% | 0 | 0 | 100% |
| JS (Node+JSDoc) | 524 | 5% | 0 | 0 | 100% |
| Go (zero-dep) | 540 | 5% | 0 | 0 | 100% |
| Pascal | — | — | — | — | fine |
| OCaml | — | — | — | — | ~85% |
“Deliberation” = fraction of thinking time spent worrying about the language rather than writing code.
The interesting finding: compound uncertainty
Each Odin stdlib call was ~75% confidence. Reasonable. But 15 such calls in a 432-line program:
0.7515 = 1.3% chance of all-correct
That's why Odin needed 4 web searches and 2 fixups while Nim (same program, similar language tier) needed zero. The bottleneck isn't any single unknown — it's the product of many small ones.
| Language | Confidence/call | Stdlib calls | P(all correct) | Outcome |
|---|---|---|---|---|
| Go | 100% | 15 | 100% | 0 fixups, 0 searches |
| Python | 100% | 12 | 100% | 0 fixups, 0 searches |
| Nim | 95% | 12 | 54% | 1 trivial fixup |
| Odin | 75% | 15 | 1.3% | 2 fixups, 4 searches |
The math predicted the outcomes almost exactly.
The predicted finding: confidence calibration works
Before each port, the LLM predicted its confidence, expected fixup count, and deliberation level. After writing, the actuals were compared.
| Language | Predicted confidence | Actual | Predicted fixups | Actual | Predicted deliberation | Actual |
|---|---|---|---|---|---|---|
| Go | 100% | 100% | 0 | 0 | 5% | 5% |
| Python | 100% | 100% | 0 | 0 | 2% | 2% |
| Nim | 95% | 95% | 0–1 | 1 | ~20% | 20% |
| F# | 95% | 95% | 0–1 | 0 | 25–35% | 15% |
| Crystal | 93% | 93% | 0–1 | 0 | ~15% | 15% |
| Odin | 70% | 70% | 1–3 | 2 | ~80% | 80% |
| JS | 100% | 100% | 0 | 0 | ~5% | 5% |
| Java 8 | 100% | 100% | 0 | 0 | ~3% | 3% |
Confidence and fixup predictions: near-perfect. F# deliberation was the only miss — predicted 25–35%, actual 15%. Paradigm choices (pipe vs loop, Map vs Dictionary) turned out to be obvious. Deliberation is driven by API uncertainty, not conceptual difficulty.
Line count predictions, however, were consistently wrong:
| Language | Predicted lines | Actual | Error |
|---|---|---|---|
| Python | 150–200 | 253 | +27% |
| F# | 200–280 | 353 | +26% |
| JS | 260–310 | 524 | +69% |
| Java 8 | 650–800 | 495 | −31% |
The LLM knows what it doesn't know. It doesn't know how long things take to say.
What deliberation is actually about
The surprise: deliberation doesn't track language difficulty, stdlib gaps, or paradigm unfamiliarity. It tracks uncertainty about API names.
| Language | Stdlib gaps | API uncertainty | Deliberation |
|---|---|---|---|
| JS/Node | Many (no CSV, no HTML escape) | None — workarounds instantly known | 5% |
| Java 8 | None | None — one way to do everything | 3% |
| F# | Some (no CSV lib) | None — idiomatic choices obvious | 15% |
| Odin | Few | High — 9 “does this exist?” pauses | 80% |
JS has more stdlib gaps than Odin for this task. But a known gap costs lines, not thinking time. An unknown API costs thinking time regardless of whether it exists.
Odin has a CSV reader. The LLM wasn't sure of its exact name — that cost more deliberation than hand-rolling one in JS where it was certain no CSV module exists.
Java 8: the paradox
Java 8 was supposed to be the pain port. No var, no records, no text blocks.
Instead it had the lowest deliberation of any language (3%).
Why: Java 8 is maximally constrained. One way to declare a list. One way to iterate. One way to aggregate. The design choice space is near-zero. The LLM doesn't think — it just types, at high speed, for a long time.
| Metric | Java 8 | Go | JS |
|---|---|---|---|
| Deliberation | 3% | 5% | 5% |
| Lines | 495 | 540 | 524 |
| Fixups | 0 | 0 | 0 |
| Design choices | ~0 | ~2 | ~5 |
The irony: Java 8 is the most annoying language for a human but the most effortless for the LLM. Verbosity is measured in tokens, and tokens are cheap. Uncertainty is measured in verification pauses, and Java 8 has none.
Two failures, same root cause
Neither Pascal nor OCaml failed because of language knowledge.
Pascal, attempt 1: The LLM read all existing ports (~1500 lines) before writing anything. Over-planned. Ran out of context. Zero lines of Pascal written.
Pascal, attempt 2: Instructed to just start typing. It did. Hit the output token limit before emitting a single line. Pascal's begin/end verbosity made it the one port where generation itself exceeds the budget.
The first attempt failed because the LLM wasted tokens on input. The second failed because Pascal wastes tokens on output.
OCaml: Confidence was ~85%. Never tested. The port died at opam init on Windows — Cygwin PATH issues consumed the entire session. Zero lines of OCaml written.
The “come back in 3 years and build it” test:
| Language | Revisit command |
|---|---|
| Go | go build |
| Python | uv run app.py |
| Nim | nimble build |
| F# | dotnet build |
| Odin | odin build . |
| JS | node app.js |
| Java 8 | javac *.java && java app.Main |
| OCaml | Install opam. Figure out Cygwin. Fix PATH. Hope opam init works. Install compiler switch. Install dune. Install deps. Maybe build. |
Toolchain accessibility is a language feature.
F#: where functional actually mattered
Most ports converge to the same shape regardless of syntax. F# was the exception. Its aggregation pipeline is structurally different:
Go (10 lines, imperative map-accumulate-sort):
totals := map[string]float64{}
for _, t := range transfers {
totals[t.Category] += t.AmountF
}
// ... sort, collect into slice
F# (4 lines, pipeline):
transfers
|> List.groupBy (fun t -> t.Category)
|> List.map (fun (cat, ts) -> { Name = cat; Total = ts |> List.sumBy (fun t -> t.AmountF) })
|> List.sortByDescending (fun nt -> nt.Total)
But this only mattered for aggregation and the main processing flow. CSV parsing, file I/O, HTML generation — all converged to the same imperative shape in every language.
The takeaways
1. Training data density is everything. Go/Python = zero friction. Nim/Crystal = smooth. Odin = constant verification pauses. Language quality is irrelevant if the LLM can't recall the stdlib.
2. Compound uncertainty kills. 75% confidence per API call sounds fine. Across 15 calls it's 1.3%. Small unknowns multiply into mandatory verification loops.
3. The LLM knows what it doesn't know. Confidence and fixup predictions were near-perfect across 8 languages. Line count predictions were not. Calibration works for difficulty; it fails for effort.
4. Verbosity is free, uncertainty is expensive. Go (540 lines) was faster to write than Odin (432 lines). Java 8 (495 lines, 3% deliberation) was the most effortless port. Token cost is irrelevant next to verification cost.
5. Constraint helps LLMs. Languages with fewer ways to express the same thing (Go, Java) produce lower deliberation than languages with more choices (JS, Nim). The boilerplate is the documentation.
6. Setup is a language feature. Two languages with known-good LLM confidence (Pascal ~fine, OCaml ~85%) produced zero lines of code. Process killed them, not knowledge.