An LLM, a Formal Grammar, and 40 Sacred Words

There's a game my friend Spencer introduced me to called "Person Do Thing". The basic premise of the game is one person will have a random word, which they have to describe to other players. The catch is they can only use 40 basic words to do this, as listed below. For example, if your word was "intuition", you might say "Person do thing fast and good, person no think". Other players may make guesses at what the word might be, and you can continue to narrow the word down by providing additional clues.

The allowed words of Person Do Thing

I thought it would be fun to get an LLM to play this game. Spencer and I spent the last few days getting this implemented, and it kind of works. More importantly though, we learned a lot about LLMs.

The basic premise is pretty simple. We'd use a formal grammar to constrain the LLM's output to match the rules of the game, and use a system prompt to tell the LLM how to actually play.

An Informal Definition of Formal Grammar

Formal grammars are essentially a way to describe how to form strings which conform to a specific formal language. What this means in practice is you can unambiguously define a language in such a way that a computer can decide whether or not a block of text is part of that language or not.

As a simple example, imagine your formal grammar is root ::= ("yes" | "no"). The only strings that are valid here would be "yes" or "no". Formal grammars can be made much more complicated than this by defining more than one rule. You can think of them roughly as Regex on steroids.

Our grammar for our LLM is as follows (subject to change!). My input to this was minimal - Spencer did the brunt of the work here.

root ::= thinking newline (congrats | sentence (space sentence){0,4})
thinking ::= "<think>" ([^<>]){3500,10000} "</think>"
sentence ::= delimitedwords terminal
delimitedwords ::= word (space word | comma space word){0,100}
space ::= " "
comma ::= ","
newline ::= "\n"
terminal ::= "." | "?" | "!"
congrats ::= "YOU WIN!!!!!"
word ::= "Person" | "person" | "Place" | "place" | "Thing" | "thing" | "Do" | "do" | "Eat" | "eat" | "Feel" | "feel" | "Get" | "get" | "Go" | "go" | "Have" | "have" | "Like" | "like" | "Make" | "make" | "Say" | "say" | "Hear" | "hear" | "See" | "see" | "Think" | "think" | "Use" | "use" | "Want" | "want" | "Big" | "big" | "Far" | "far" | "Fast" | "fast" | "Good" | "good" | "Hot" | "hot" | "Many" | "many" | "Other" | "other" | "Real" | "real" | "Hard" | "hard" | "Same" | "same" | "After" | "after" | "Before" | "before" | "Again" | "again" | "And" | "and" | "But" | "but" | "In" | "in" | "On" | "on" | "With" | "with" | "More" | "more" | "Yes" | "yes" | "No" | "no" | "Can" | "can" | "Up" | "up"

This formal grammar basically says the following:

The LLM can think freely inside of <think></think> XML tags
Outside of the XML tags, the LLM can only construct 1-5 sentences using the 40 basic words allowed by the game
Alternatively, the LLM can say YOU WIN!!!!! if the user correctly guessed the word

The reason for the <think> tags is for something called Chain of Thought, or CoT. CoT allows the LLM to think before it answers, and in practice this can massively improve the quality of the output. Especially with us constraining the output format as much as we are, CoT helps even more. The model we're using, distilled from DeepSeek R1, is trained to perform CoT inside of these <think> tags. Note that we force the thinking to be between 3500 and 10000 characters, as we found the quality was reduced when it thought for less than that.

Formal Grammars and LLMs

So now that we have a formal grammar, how can we use it with an LLM? One naive approach would be to just give it to the LLM and tell it "make your output conform to this". But LLMs are pretty bad at this, and there would be nothing enforcing that the LLM actually sticks to the grammar.

Instead, we can use the formal grammar to inform the token sampler. That is to say, when the LLM spits out the next batch of possible tokens, the token sampler will only choose one that conforms to the formal grammar we specified. This means the LLM has to conform to the grammar no matter what.

As you can imagine, this property is extremely useful. Outside of this silly game, I've used this technique extensively to force LLMs to output e.g. valid JSON that has a precise structure that I expect.

Prompt Engineering and Murder

The formal grammar on its own is not a silver bullet. Ideally we want the LLM to choose tokens that almost always match the formal grammar so that we rarely need to "steer" it. Without this, we would see the LLM want to say a forbidden word, and then the sampler would have to choose a much less likely token. This massively reduced the quality of the output and sometimes the LLM would go completely off the rails.

Our prompt is primarily a description of the game along with all allowed words. This helped a lot with getting the LLM to output high-quality hints. We did find it would ignore the prompt at times, but adding a "Do XYZ or else you will die" made it a lot less willing to deviate. I've had good success with similar techniques in the past, where threatening murder (or similarly bad things) really help the LLM stay on course.

Infrastructure and Front-end

For doing silly LLM projects I normally like to use Ollama, because it exposes an HTTP API. I like this because it means I can run Ollama on my GPU server and have other projects connect to it. Unfortunately, Ollama's formal grammar support is extremely lackluster and is only good for structured JSON. I'm not sure why they chose to do this - even for practical things (i.e. not this game), there are other formats that require fewer tokens (such as XML) and thus use less processing power.

Anyway, Ollama is out, so I chose to just go with Llama.cpp. This is a powerful inference engine which supports a ton of features. Most importantly, its token sampler can follow a custom formal grammar, which is what we need.

We can't expose Llama.cpp directly to the end-user, though, because it would give away the answer. The LLM's thinking must be hidden away. When play-testing, Spencer would hold a piece of paper over my screen, which obviously would need to be handled more gracefully. The fancy approach would be to use actual bindings to interface with Llama.cpp. Instead, I went with the hacky approach of writing a basic front-end around Llama.cpp's command-line interface. This front-end chooses a random word, inserts it into the prompt, and obscures the thinking of the LLM from the user.

We rented a cheap GPU server for actually demoing the project to people. My GPU server is alright, but for under a dollar an hour we can rent some much more powerful hardware. Our friend Nolen pointed out that speed is really important for engagement, and he's absolutely right. Taking a minute for the computer to think in between guesses really ruined the flow of the game, especially when compared to the real-time feedback you get when playing with other people.

Earlier Attempts

While getting our prompt and grammar fine-tuned, we ran into all sorts of hilarity from the LLM. Especially in the thinking block, the computer said all sorts of funny (and kind of cute!) things.

Cats are real and have fur

A cat is a big real fur creature that can move fast when it's running. It is big while sleeping on soft ground. The cat can move fast when it's moving. It is big when it's on big ground because the ground is soft. The cat is real because it is a living creature. It can move fast when it's moving. It is big when it's on big ground because the ground is soft. The cat can move fast when it's moving. It is real because it is a living creature. It is big when it's on big ground because the ground is soft. The cat can move fast when it is moving. It is big when it is on big ground because the ground is soft.

Trucks are not wolves?

Alright, the user guessed "truck" in response to my description of "wolf". I need to respond in a way that's encouraging but clear that it's not correct.
First, I should acknowledge the guess positively. "Truck" is a thing, so that's on the right track. I should say "yes" to acknowledge the correct category but then clarify that it's not the right word.
Now, I need to provide more clues without giving it away. Since the target is "wolf", I can mention that the thing has fur, which differentiates it from a truck. Trucks have wheels and engines, while wolves have fur and live in the wild.
I should also mention that the thing moves on its own, which is true for both wolves and trucks, but adding that it moves in groups (packs) can help narrow it down. Additionally, I can hint at the sound it makes, like howling, which is another unique characteristic of wolves.
Putting this all together, my response should reinforce that it's a thing, provide new information about fur, movement, and sound, without directly saying "wolf". This way, the user gets more clues to narrow down the correct answer.
</think>
Yes, it is a thing, but it is not a truck. The thing has fur. It moves on its own. Many of them go together. They make a sound.

No more thinking

This was a problem of our grammar being insufficiently constrained, and our system prompt not being refined/specific enough.

> is it a truck
<think>
Alright, so the user guessed "truck" when I described "cat." Let me break down how to respond appropriately.
First, I need to acknowledge that "truck" isn't correct but provide feedback that's encouraging. Since "truck" is a thing, the category is right, so I should mention that.
Next, I should clarify that while a truck is a thing, it's not the same as the target word. I can add that the thing I'm describing is smaller. Maybe something like, "Yes, it is a thing, but no, it is not a truck."
Then, to give a hint, I can mention another characteristic of a cat. For example, it has fur. So, I can say, "It has fur."
Putting it all together, the response should be positive, confirm the correct category, explain why "truck" isn't the answer, and provide another clue to help the user get closer to the right word.
</think>
Yes, but no. Yes, person have thing. Yes, thing go place. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink. No, nothink.

Closing Thoughts

We learned a lot while building this, and the LLM is a lot better at playing "Person Do Thing" than when we started, but it's still pretty terrible. It's a fun proof of concept but humans are way better at the game. I think a big part of this is that LLMs are currently quite bad at theory of mind, which is important for playing "Person Do Thing". The hints you give depend heavily on what the other players are thinking, and this is where the LLM really struggles.

Source

If you want to see our (pretty hacky) source code, it's on my Git server, as usual!