Wordle experiment

Teaching a Wordle Bot to Play Like a Human

a place where curiosity wins and productivity negotiates

“We are not beating the math. We are building the solver a very strong human would actually want to use.” — Jason

Stewart Labs has a habit of starting with a harmless question and then accidentally building a small machine around it. This one started with Wordle.

The obvious question was mathematical: can we build a good solver? That part has been studied deeply by people with more formal methods, more compute, and less interest in wandering off into side quests. The pure mathematical limit for Wordle is already very close to solved, with elite solvers averaging around 3.421 guesses under the standard benchmark.

So eventually the better question became: can we build something different? Not an academic optimal solver. A human-style assistant. Something that uses math, but also uses the things strong human players actually use: memory, judgment, streak protection, prior-answer history, recency, and the knowledge that Wordle now allows repeated answers.

That is what this project became: a local Python Wordle lab with two separate personalities. Pure Mode is the clean math benchmark. Human Mode is the real daily-play assistant.

The short version

Pure Mode average

3.47

Pure Mode sixes

Human Mode weighted average

3.4164

Human Mode weighted sixes

0.00

Dated prior answers

1,805

Repeated answers found

The final Human Mode benchmark came in at a weighted average of 3.4164 guesses, with 0.00 weighted sixes and 0.00 weighted failures.

That is the fun headline: under the project’s stated real-world Human Mode benchmark, the solver slipped under the symbolic 3.421 target.

That does not mean it beat the pure mathematical optimum. It means something more specific and, frankly, more interesting: once we model the game the way humans now play it — with prior answers downweighted but not impossible — the assistant performs at an elite level.

Pure Mode: the math engine underneath

The project started with a traditional solver: pick a first word, score feedback, filter the answer list, and choose the next guess based on how well it splits the remaining candidates.

The current Pure Mode champion uses:

strategy: second-map-bucket first guess: slate second pool: answers all 2,315 answers weighted equally

python -m src.wordle_lab \
  --answers data/wordle_answers_full.txt \
  --allowed data/wordle_allowed_guesses_full.txt \
  --strategy second-map-bucket \
  --first slate \
  --second-guess-pool answers \
  --worst-patterns 25

Metric	Pure Mode result
Tested	2,315
Solved	2,315
Average guesses	3.47
Solved in 3 or fewer	1,175
Solved in 4 or fewer	2,276
Five-guess games	39
Six-guess games	0
Failed games	0
Risk score	78

This is a strong heuristic solver. It is not globally optimal, but it is good enough to be dangerous. More importantly, it gives Human Mode a serious math engine to build on.

The important design choice was to stop obsessing over pure average. Chasing the academic optimum from 3.47 down toward 3.421 would mean building a global game-tree optimizer. That would be interesting, but it would not be the project we actually wanted.

The pivot: humans do not play the pure benchmark

The pure benchmark treats every possible answer as equally likely. That is fair for comparing solvers, but it is not how a serious daily Wordle player thinks.

A human player knows that some words have already appeared. A human player knows that a word used yesterday is much less attractive than a word that has never appeared. But the modern wrinkle is that prior answers are no longer impossible. Wordle can repeat answers now, so a smart player should not delete the past entirely. The past should become a weight, not a wall.

That became the central Human Mode idea:

prior answer ≠ impossible
prior answer = less likely, depending on recency

That one distinction changed the project. The solver stopped being a generic optimizer and started becoming a model of real play.

Building memory: the dated prior-answer file

To make Human Mode real, the solver needed historical memory. We collected a dated list of past Wordle answers and converted it into a local CSV file:

date,word
2026-05-28,divot
2026-05-27,stuff
2026-05-26,couch
...
2021-06-21,sissy
2021-06-20,rebut
2021-06-19,cigar

The final dated prior-answer file contained:

Dated rows

1,805

Unique prior words

1,793

Repeated answers

Oldest date

2021-06-19

Newest date

2026-05-28

The repeated answers were not an annoyance. They were proof that the model had to be careful. If answers can repeat, then a solver that blindly excludes all prior answers is solving a game that no longer exists.

Human Mode weighting

The first Human Mode used a simple recency schedule. Never-used answers had full weight. Prior answers were downweighted based on how recently they appeared.

After experimenting with several schedules, the winning custom schedule became intentionally aggressive:

Answer history	Weight
Never used	1.00
Used within last 90 days	0.005
Used 91–365 days ago	0.05
Used 366–730 days ago	0.15
Used more than 730 days ago	0.35

This is a strong opinion: recent prior answers are extremely unlikely, but not impossible. Old prior answers can come back, but they still do not look as attractive as fresh unused answers.

The weighted average is computed like this:

sum(prior_weight * guesses) / sum(prior_weight)

That creates a benchmark that is not pure Wordle math. It is a model of expected real-world daily play.

The Human Mode champion

The current Human Mode champion uses the same core strategy as Pure Mode, but adds dated prior-answer weighting, human-specific overrides, and the custom aggressive schedule.

python -m src.wordle_lab \
  --answers data/wordle_answers_full.txt \
  --allowed data/wordle_allowed_guesses_full.txt \
  --strategy second-map-bucket \
  --first slate \
  --second-guess-pool answers \
  --prior-answers-dated data/prior_answers_dated.csv \
  --prior-policy downweight \
  --prior-weight-values 0.005,0.05,0.15,0.35 \
  --as-of-date 2026-05-28 \
  --show-weighted-score \
  --weighted-worst-patterns 25

Metric	Human Mode result
Total weight	998.37
Weighted average guesses	3.4164
Weighted solved in 3 or fewer	555.96
Weighted solved in 4 or fewer	987.60
Weighted five-guess games	10.77
Weighted six-guess games	0.00
Weighted failed games	0.00

That number — 3.4164 — is the project’s little trophy.

Again, the honest interpretation matters: this is not a claim that the solver beat the pure mathematical optimum. It is a claim that the Human Mode model, with real-world prior-answer weighting, produced an elite result below the symbolic 3.421 target.

The human-specific overrides

One of the most interesting discoveries was that the best Pure Mode play is not always the best Human Mode play.

For example, after opening with slate, the feedback pattern ....Y means only the e is present, and it is not in the fifth position.

Pure Mode prefers:

slate ....Y -> rocky

Human Mode prefers:

slate ....Y -> drown

That is the whole project in miniature. The pure solver wants the mathematically best equal-answer split. The human solver wants the best weighted real-world path.

The current Human Mode overrides are:

State after SLATE	Pure Mode choice	Human Mode choice
`....Y`	`rocky`	`drown`
`..YY.`	`pouch`	`hound`
`..Y.Y`	`march`	`began`

Some of these choices look strange if you only inspect the immediate next bucket. That is because the overrides were chosen by downstream weighted simulation, not just by the next split. A move can look slightly worse locally and still be better for the full branch.

Daily use: the assistant part

The project would be less interesting if it only produced benchmark tables. The useful version is the daily recommendation interface.

For Human Mode:

python -m src.wordle_lab \
  --human-recommend slate ....Y \
  --prior-weight-values 0.005,0.05,0.15,0.35

That returns:

Recommended next guess: drown
Recommendation type: probe
Explanation: Used Human Mode override for slate ....Y.

For Pure Mode:

python -m src.wordle_lab --pure-recommend slate ....Y

That returns:

Recommended next guess: rocky

And for a deeper Human Mode state:

python -m src.wordle_lab \
  --human-recommend slate ....Y drown .Y... \
  --prior-weight-values 0.005,0.05,0.15,0.35

The solver recommended furry as a probe, because the remaining branch had a lot of -er and -y pressure. That is exactly the kind of move a strong human might make: do not just guess one plausible answer too early; split the dangerous family first.

What the solver is really doing

Underneath the friendly command, the solver is doing several things at once:

Filtering possible answers using exact Wordle feedback.
Choosing guesses that minimize dangerous feedback buckets.
Using prior-answer recency as a likelihood layer.
Keeping prior answers possible, because repeats exist.
Using Human Mode overrides where downstream simulation shows an advantage.
Showing alternatives so the recommendation is explainable.

That last part matters. A solver that says “play this because I said so” is not very satisfying. A solver that says “play this because it keeps the worst bucket small, beats the alternatives downstream, and respects the prior-answer model” starts to feel like a coach.

The honest scoreboard

The project now has two champions, and they should not be confused.

Mode	Purpose	Average	Sixes
Pure Mode	Fair equal-answer benchmark	3.47	0
Human Mode	Weighted real-world daily play	3.4164	0.00 weighted

Pure Mode is the honest math benchmark. Human Mode is the honest human-play benchmark.

The Human Mode result is allowed to beat 3.421 because it is not playing the same game as the pure optimizer. It is playing a weighted version of the real daily game, where answer history matters. That distinction is not a loophole; it is the point.

What we actually accomplished

We did not beat academia. We did something better suited to Stewart Labs: we built the bot we actually wanted.

It started as a pure Wordle solver. It became an experiment in how strong humans think. The final assistant is mathematical, but not sterile. It remembers. It downweights. It protects the streak. It explains itself. It knows that the best move is not always the word with the prettiest immediate split.

The project is also a good reminder that “optimal” depends on the question. If the question is “what is the best decision tree over a fixed answer list with equal probabilities?” then this solver is not the champion. If the question is “what should a serious daily Wordle player do, given answer history and the reality of repeats?” then this thing has become surprisingly formidable.

And yes, getting the Human Mode average to 3.4164 felt like a win. Not because it proves some grand theorem. Because it means the original hunch was right: a solver can be more than a calculator. It can have judgment.

Project notes

The project is local, simple to run, and intentionally free of web dependencies. No API keys. No network calls. No secrets. Just word lists, Python, and a growing pile of commands that probably made sense at the time.

The code lives in a local project folder and is versioned in GitHub. The normal workflow is still the satisfying little ritual:

git status
git add .
git commit -m "Describe what changed"
git push

Current Human Mode result is based on the dated prior-answer file through 2026-05-28 and the custom prior-weight schedule 0.005,0.05,0.15,0.35. Future Wordle behavior may change the best schedule, which is part of the fun.

← Back to Stewart Labs