Wordle experiment
Teaching a Wordle Bot to Play Like a Human
Stewart Labs has a habit of starting with a harmless question and then accidentally building a small machine around it. This one started with Wordle.
The obvious question was mathematical: can we build a good solver? That part has been studied deeply by people with more formal methods, more compute, and less interest in wandering off into side quests. The pure mathematical limit for Wordle is already very close to solved, with elite solvers averaging around 3.421 guesses under the standard benchmark.
So eventually the better question became: can we build something different? Not an academic optimal solver. A human-style assistant. Something that uses math, but also uses the things strong human players actually use: memory, judgment, streak protection, prior-answer history, recency, and the knowledge that Wordle now allows repeated answers.
That is what this project became: a local Python Wordle lab with two separate personalities. Pure Mode is the clean math benchmark. Human Mode is the real daily-play assistant.
The short version
The final Human Mode benchmark came in at a weighted average of 3.4164 guesses, with 0.00 weighted sixes and 0.00 weighted failures.
That is the fun headline: under the project’s stated real-world Human Mode benchmark, the solver slipped under the symbolic 3.421 target.
That does not mean it beat the pure mathematical optimum. It means something more specific and, frankly, more interesting: once we model the game the way humans now play it — with prior answers downweighted but not impossible — the assistant performs at an elite level.
Pure Mode: the math engine underneath
The project started with a traditional solver: pick a first word, score feedback, filter the answer list, and choose the next guess based on how well it splits the remaining candidates.
The current Pure Mode champion uses:
python -m src.wordle_lab \
--answers data/wordle_answers_full.txt \
--allowed data/wordle_allowed_guesses_full.txt \
--strategy second-map-bucket \
--first slate \
--second-guess-pool answers \
--worst-patterns 25
| Metric | Pure Mode result |
|---|---|
| Tested | 2,315 |
| Solved | 2,315 |
| Average guesses | 3.47 |
| Solved in 3 or fewer | 1,175 |
| Solved in 4 or fewer | 2,276 |
| Five-guess games | 39 |
| Six-guess games | 0 |
| Failed games | 0 |
| Risk score | 78 |
This is a strong heuristic solver. It is not globally optimal, but it is good enough to be dangerous. More importantly, it gives Human Mode a serious math engine to build on.
The important design choice was to stop obsessing over pure average. Chasing the academic optimum from 3.47 down toward 3.421 would mean building a global game-tree optimizer. That would be interesting, but it would not be the project we actually wanted.
The pivot: humans do not play the pure benchmark
The pure benchmark treats every possible answer as equally likely. That is fair for comparing solvers, but it is not how a serious daily Wordle player thinks.
A human player knows that some words have already appeared. A human player knows that a word used yesterday is much less attractive than a word that has never appeared. But the modern wrinkle is that prior answers are no longer impossible. Wordle can repeat answers now, so a smart player should not delete the past entirely. The past should become a weight, not a wall.
That became the central Human Mode idea:
prior answer ≠ impossible
prior answer = less likely, depending on recency
That one distinction changed the project. The solver stopped being a generic optimizer and started becoming a model of real play.
Building memory: the dated prior-answer file
To make Human Mode real, the solver needed historical memory. We collected a dated list of past Wordle answers and converted it into a local CSV file:
date,word
2026-05-28,divot
2026-05-27,stuff
2026-05-26,couch
...
2021-06-21,sissy
2021-06-20,rebut
2021-06-19,cigar
The final dated prior-answer file contained:
The repeated answers were not an annoyance. They were proof that the model had to be careful. If answers can repeat, then a solver that blindly excludes all prior answers is solving a game that no longer exists.
Human Mode weighting
The first Human Mode used a simple recency schedule. Never-used answers had full weight. Prior answers were downweighted based on how recently they appeared.
After experimenting with several schedules, the winning custom schedule became intentionally aggressive:
| Answer history | Weight |
|---|---|
| Never used | 1.00 |
| Used within last 90 days | 0.005 |
| Used 91–365 days ago | 0.05 |
| Used 366–730 days ago | 0.15 |
| Used more than 730 days ago | 0.35 |
This is a strong opinion: recent prior answers are extremely unlikely, but not impossible. Old prior answers can come back, but they still do not look as attractive as fresh unused answers.
The weighted average is computed like this:
sum(prior_weight * guesses) / sum(prior_weight)
That creates a benchmark that is not pure Wordle math. It is a model of expected real-world daily play.
The Human Mode champion
The current Human Mode champion uses the same core strategy as Pure Mode, but adds dated prior-answer weighting, human-specific overrides, and the custom aggressive schedule.
python -m src.wordle_lab \
--answers data/wordle_answers_full.txt \
--allowed data/wordle_allowed_guesses_full.txt \
--strategy second-map-bucket \
--first slate \
--second-guess-pool answers \
--prior-answers-dated data/prior_answers_dated.csv \
--prior-policy downweight \
--prior-weight-values 0.005,0.05,0.15,0.35 \
--as-of-date 2026-05-28 \
--show-weighted-score \
--weighted-worst-patterns 25
| Metric | Human Mode result |
|---|---|
| Total weight | 998.37 |
| Weighted average guesses | 3.4164 |
| Weighted solved in 3 or fewer | 555.96 |
| Weighted solved in 4 or fewer | 987.60 |
| Weighted five-guess games | 10.77 |
| Weighted six-guess games | 0.00 |
| Weighted failed games | 0.00 |
That number — 3.4164 — is the project’s little trophy.
Again, the honest interpretation matters: this is not a claim that the solver beat the pure mathematical optimum. It is a claim that the Human Mode model, with real-world prior-answer weighting, produced an elite result below the symbolic 3.421 target.
The human-specific overrides
One of the most interesting discoveries was that the best Pure Mode play is not always the best Human Mode play.
For example, after opening with slate, the feedback pattern ....Y means only the e is present, and it is not in the fifth position.
Pure Mode prefers:
slate ....Y -> rocky
Human Mode prefers:
slate ....Y -> drown
That is the whole project in miniature. The pure solver wants the mathematically best equal-answer split. The human solver wants the best weighted real-world path.
The current Human Mode overrides are:
| State after SLATE | Pure Mode choice | Human Mode choice |
|---|---|---|
....Y | rocky | drown |
..YY. | pouch | hound |
..Y.Y | march | began |
Some of these choices look strange if you only inspect the immediate next bucket. That is because the overrides were chosen by downstream weighted simulation, not just by the next split. A move can look slightly worse locally and still be better for the full branch.
Daily use: the assistant part
The project would be less interesting if it only produced benchmark tables. The useful version is the daily recommendation interface.
For Human Mode:
python -m src.wordle_lab \
--human-recommend slate ....Y \
--prior-weight-values 0.005,0.05,0.15,0.35
That returns:
Recommended next guess: drown
Recommendation type: probe
Explanation: Used Human Mode override for slate ....Y.
For Pure Mode:
python -m src.wordle_lab --pure-recommend slate ....Y
That returns:
Recommended next guess: rocky
And for a deeper Human Mode state:
python -m src.wordle_lab \
--human-recommend slate ....Y drown .Y... \
--prior-weight-values 0.005,0.05,0.15,0.35
The solver recommended furry as a probe, because the remaining branch had a lot of -er and -y pressure. That is exactly the kind of move a strong human might make: do not just guess one plausible answer too early; split the dangerous family first.
What the solver is really doing
Underneath the friendly command, the solver is doing several things at once:
- Filtering possible answers using exact Wordle feedback.
- Choosing guesses that minimize dangerous feedback buckets.
- Using prior-answer recency as a likelihood layer.
- Keeping prior answers possible, because repeats exist.
- Using Human Mode overrides where downstream simulation shows an advantage.
- Showing alternatives so the recommendation is explainable.
That last part matters. A solver that says “play this because I said so” is not very satisfying. A solver that says “play this because it keeps the worst bucket small, beats the alternatives downstream, and respects the prior-answer model” starts to feel like a coach.
The honest scoreboard
The project now has two champions, and they should not be confused.
| Mode | Purpose | Average | Sixes |
|---|---|---|---|
| Pure Mode | Fair equal-answer benchmark | 3.47 | 0 |
| Human Mode | Weighted real-world daily play | 3.4164 | 0.00 weighted |
Pure Mode is the honest math benchmark. Human Mode is the honest human-play benchmark.
The Human Mode result is allowed to beat 3.421 because it is not playing the same game as the pure optimizer. It is playing a weighted version of the real daily game, where answer history matters. That distinction is not a loophole; it is the point.
What we actually accomplished
We did not beat academia. We did something better suited to Stewart Labs: we built the bot we actually wanted.
It started as a pure Wordle solver. It became an experiment in how strong humans think. The final assistant is mathematical, but not sterile. It remembers. It downweights. It protects the streak. It explains itself. It knows that the best move is not always the word with the prettiest immediate split.
The project is also a good reminder that “optimal” depends on the question. If the question is “what is the best decision tree over a fixed answer list with equal probabilities?” then this solver is not the champion. If the question is “what should a serious daily Wordle player do, given answer history and the reality of repeats?” then this thing has become surprisingly formidable.
And yes, getting the Human Mode average to 3.4164 felt like a win. Not because it proves some grand theorem. Because it means the original hunch was right: a solver can be more than a calculator. It can have judgment.
Project notes
The project is local, simple to run, and intentionally free of web dependencies. No API keys. No network calls. No secrets. Just word lists, Python, and a growing pile of commands that probably made sense at the time.
The code lives in a local project folder and is versioned in GitHub. The normal workflow is still the satisfying little ritual:
git status
git add .
git commit -m "Describe what changed"
git push
Current Human Mode result is based on the dated prior-answer file through 2026-05-28 and the custom prior-weight schedule 0.005,0.05,0.15,0.35. Future Wordle behavior may change the best schedule, which is part of the fun.