← alexv.lv

I ported one program to 10 languages to see how an LLM thinks

I asked Claude to port a 500-line Go bank statement analyzer to 8 languages. Same logic, same tests, different stdlib. The goal wasn't the code — it was watching the LLM's internal process: where it hesitated, where it didn't, and whether it could predict its own performance.

It could.

The scoreboard

Language Lines Deliberation Fixups Web searches Confidence
Python + uv2532%00100%
Nim29920%1095%
F#35315%0095%
Crystal36315%0093%
Odin43280%2470%
Java 84953%00100%
JS (Node+JSDoc)5245%00100%
Go (zero-dep)5405%00100%
Pascalfine
OCaml~85%

“Deliberation” = fraction of thinking time spent worrying about the language rather than writing code.

The interesting finding: compound uncertainty

Each Odin stdlib call was ~75% confidence. Reasonable. But 15 such calls in a 432-line program:

0.7515 = 1.3% chance of all-correct

That's why Odin needed 4 web searches and 2 fixups while Nim (same program, similar language tier) needed zero. The bottleneck isn't any single unknown — it's the product of many small ones.

Language Confidence/call Stdlib calls P(all correct) Outcome
Go100%15100%0 fixups, 0 searches
Python100%12100%0 fixups, 0 searches
Nim95%1254%1 trivial fixup
Odin75%151.3%2 fixups, 4 searches

The math predicted the outcomes almost exactly.

The predicted finding: confidence calibration works

Before each port, the LLM predicted its confidence, expected fixup count, and deliberation level. After writing, the actuals were compared.

Language Predicted confidence Actual Predicted fixups Actual Predicted deliberation Actual
Go100%100%005%5%
Python100%100%002%2%
Nim95%95%0–11~20%20%
F#95%95%0–1025–35%15%
Crystal93%93%0–10~15%15%
Odin70%70%1–32~80%80%
JS100%100%00~5%5%
Java 8100%100%00~3%3%

Confidence and fixup predictions: near-perfect. F# deliberation was the only miss — predicted 25–35%, actual 15%. Paradigm choices (pipe vs loop, Map vs Dictionary) turned out to be obvious. Deliberation is driven by API uncertainty, not conceptual difficulty.

Line count predictions, however, were consistently wrong:

Language Predicted lines Actual Error
Python150–200253+27%
F#200–280353+26%
JS260–310524+69%
Java 8650–800495−31%

The LLM knows what it doesn't know. It doesn't know how long things take to say.

What deliberation is actually about

The surprise: deliberation doesn't track language difficulty, stdlib gaps, or paradigm unfamiliarity. It tracks uncertainty about API names.

Language Stdlib gaps API uncertainty Deliberation
JS/NodeMany (no CSV, no HTML escape)None — workarounds instantly known5%
Java 8NoneNone — one way to do everything3%
F#Some (no CSV lib)None — idiomatic choices obvious15%
OdinFewHigh — 9 “does this exist?” pauses80%

JS has more stdlib gaps than Odin for this task. But a known gap costs lines, not thinking time. An unknown API costs thinking time regardless of whether it exists.

Odin has a CSV reader. The LLM wasn't sure of its exact name — that cost more deliberation than hand-rolling one in JS where it was certain no CSV module exists.

Java 8: the paradox

Java 8 was supposed to be the pain port. No var, no records, no text blocks. Instead it had the lowest deliberation of any language (3%).

Why: Java 8 is maximally constrained. One way to declare a list. One way to iterate. One way to aggregate. The design choice space is near-zero. The LLM doesn't think — it just types, at high speed, for a long time.

Metric Java 8 Go JS
Deliberation3%5%5%
Lines495540524
Fixups000
Design choices~0~2~5

The irony: Java 8 is the most annoying language for a human but the most effortless for the LLM. Verbosity is measured in tokens, and tokens are cheap. Uncertainty is measured in verification pauses, and Java 8 has none.

Two failures, same root cause

Neither Pascal nor OCaml failed because of language knowledge.

Pascal, attempt 1: The LLM read all existing ports (~1500 lines) before writing anything. Over-planned. Ran out of context. Zero lines of Pascal written.

Pascal, attempt 2: Instructed to just start typing. It did. Hit the output token limit before emitting a single line. Pascal's begin/end verbosity made it the one port where generation itself exceeds the budget.

The first attempt failed because the LLM wasted tokens on input. The second failed because Pascal wastes tokens on output.

OCaml: Confidence was ~85%. Never tested. The port died at opam init on Windows — Cygwin PATH issues consumed the entire session. Zero lines of OCaml written.

The “come back in 3 years and build it” test:

Language Revisit command
Gogo build
Pythonuv run app.py
Nimnimble build
F#dotnet build
Odinodin build .
JSnode app.js
Java 8javac *.java && java app.Main
OCamlInstall opam. Figure out Cygwin. Fix PATH. Hope opam init works. Install compiler switch. Install dune. Install deps. Maybe build.

Toolchain accessibility is a language feature.

F#: where functional actually mattered

Most ports converge to the same shape regardless of syntax. F# was the exception. Its aggregation pipeline is structurally different:

Go (10 lines, imperative map-accumulate-sort):

totals := map[string]float64{}
for _, t := range transfers {
    totals[t.Category] += t.AmountF
}
// ... sort, collect into slice

F# (4 lines, pipeline):

transfers
|> List.groupBy (fun t -> t.Category)
|> List.map (fun (cat, ts) -> { Name = cat; Total = ts |> List.sumBy (fun t -> t.AmountF) })
|> List.sortByDescending (fun nt -> nt.Total)

But this only mattered for aggregation and the main processing flow. CSV parsing, file I/O, HTML generation — all converged to the same imperative shape in every language.

The takeaways

1. Training data density is everything. Go/Python = zero friction. Nim/Crystal = smooth. Odin = constant verification pauses. Language quality is irrelevant if the LLM can't recall the stdlib.

2. Compound uncertainty kills. 75% confidence per API call sounds fine. Across 15 calls it's 1.3%. Small unknowns multiply into mandatory verification loops.

3. The LLM knows what it doesn't know. Confidence and fixup predictions were near-perfect across 8 languages. Line count predictions were not. Calibration works for difficulty; it fails for effort.

4. Verbosity is free, uncertainty is expensive. Go (540 lines) was faster to write than Odin (432 lines). Java 8 (495 lines, 3% deliberation) was the most effortless port. Token cost is irrelevant next to verification cost.

5. Constraint helps LLMs. Languages with fewer ways to express the same thing (Go, Java) produce lower deliberation than languages with more choices (JS, Nim). The boilerplate is the documentation.

6. Setup is a language feature. Two languages with known-good LLM confidence (Pascal ~fine, OCaml ~85%) produced zero lines of code. Process killed them, not knowledge.

alexv.lv