Making Gemini play the GOHR

Updated 2025-09-13 for Game Server ver 8.030 and gemini-2.5-flash

Architecture

As of July 31, 2025, the GOHR Gemini harness supports 2 modes: the "play" mode and the "prepared episodes" mode.

In the play mode, the Gemini bot makes moves one by one, receiving the feedback from our side. Occasionally, we ask the bot to describe its mental model of the rules; or sometimes it volunteers to do so.

In the prepared-episodes mode, we create a number of completed episodes (e.g. with random moves, eventually clearing each board), and send their transcripts to Gemini asking the bot to provide its idea of the rules, and based on that idea, tell us how it would clear several new ("future") boards shown to it (by sending to us a sequence of proposed moves). We then analyze Gemini's proposed sequences of moves to see if they would indeed clear the board.

While the setup used in the prepared episodes mode is obvious (one request, one response), the arrangements used in the play mode require a more detailed description. The setup that we use in the play mode in order to get Google Gemini to play the Rule Game (aka the GOHR) is, in a way, the inverse of that of the Captive Game Server (CGS). With the CGS, the control resides in the ML (AI) program that wants to solve a GOHR problem; whenever the ML program makes a move, it sends its proposed move as a request to the CGS, and then the CGS applies the move to the current board and returns its decision (accept/deny) in the response. With our Gemini app, the control resides in our app; the app keeps sending requests to the Gemini server at Google, asking it to make its next move. The Gemini's proposed move, hopefully, arrives in the response, and our app then applies it to the board, and prepares to send to Gemini the next request for a move.

Gemini models

As of the early 2025, Gemini offers a number of "models". We started in March 2025 with the then current model named gemini-2.0-flash, which produced very disappointing results.

Later we switched to a more recent model, gemini-2.5-flash-preview-05-20, which Google bills as a model with thinking support. It produces much better results, although it works much more slowly.

Out harness tool can be configured to use other models as well (with the command-line option model=...).

Conversation structure in the play mode

A Gemini user could easily imaginge that when a person or an app interacts with the Gemini server, the server somehow keeps track of the entire content of the "conversation", perhaps by means of creating some kind of a "model" of the knowledge conveyed to it by the user's input, in a suitable internal representation, and expanding and updating that model as more user input comes through.

Somewhat unexpectedly, Gemini documentation (see e.g. Home | Gemini API | Models | API Reference | Generating content) implies that nothing like this takes place! The Gemini server is essentially stateless, meaning that when a person or an app interacts with the Gemini server it, the server does not keep track of the user's previous requests, and does not store any "model" of that knowledge.

Instead, if you desire to have a "conversation" with the Gemini bot, you need to supply the entire record of the conversation (what you said -- what the bot said -- what you said -- ... -- what you say now) in each of your HTTP requests! The JSON structure that one sends to the server in the request provides support for "packing" one's conversation into it.

The GOHR, of course, is built on a dialogue, with the player and the Game Server typically exchanging a request/response pair (in whichever direction) per move. It is possible to literally pack one's dialogue into the Gemini request structure, using the framework provided. But since we are already quite inefficient, the decision was made to normally pack the entire transcript of a series (all episodes played in a single run of our app, on the same rule set) into a single prompt. Certain limited exceptions exist though, and are discussed in appropriate places below.

Asking Gemini about this

If asked, Gemini will tell all of this to you itself: session transcript

If you want to spare yourself reading our long and verbose conversation, here are a few salient points:

1) If you communicate with Gemini using the plain HTTP API, then yes, you have to explicitly include the same system instructions in each API call. If you use the Python API or Node.js API, then, on the surface level, it gives you the concept of a session, so you put your system instructions into the higher-level API only once. But what the higher-level API does behind the scenes is simply "prepending this instruction to the conversation history for you in subsequent sendMessage() calls." So in reality the system instructions are still sent over the wire on every API call, exactly the same way as with the plain HTTP API.

2) If I use the cached content API, then of course I send the introductory text (the meta-rules, or content of previous episodes) over the wire only once. But internally, "in terms of the [Gemini] model's core understanding and generation, the behavior should be largely identical" to what happens if no caching has been used.

3) I asked how the cached content is actually handled: is it stored simply as plain text to be parsed again and again, as some kind of parsed-text structure, or in a more sophisticated way? The answer was, quite a bit of pre-processing is done, so that the cached text's tokens "are then converted into numerical vector embeddings. These embeddings are high-dimensional mathematical representations that capture the semantic meaning of the tokens and their relationships within the context. This is the crucial step where the "text" is transformed into something the neural network can directly work with." Additionally, something called the Key-Value (KV) cache for Gemini's "attention mechanism" is filled.

So this is quite a bit of data processing work, and caching is helpful with reducing the overall amount of work in a long session. But again, it appears what happens is that caching simply eliminates the duplication of some work; the results of the work would be roughly the same with and without caching.

Using the app

The main script is in /home/vmenkov/w2020/game/scripts/gemini.sh. The usage is very similar to that of captive-full.sh; all the same arguments and options describing the rule set and the board generation are accepted.

Options

There are a number of additional options you can put on the command line, in the key=value format, as follows:

Observations - Gemini 2.0

This section was written when working with Gemini 2.0 Flash; unless otherwise indicated, everything here is only applied to Gemini 2.0. The problems described in it don't seem to appear to the same extent in Gemini 2.5 Flash; so this section is mostly irrelevant now.

When dealing with a system whose main strength is in generating natural language, one has to be ready to handle sensibly a great variety of situations. We don't intend, after all, to teach our own app to understand under natural language (such as Gemini produces); instead we put the bulk of our effort into getting Gemini to produce output in a formalized structured way that would be easy for our app to understand.

As mentioned above, the instructions I provide to Gemini emphasize that what I want to hear from it are descriptions of proposed moves in a standard format, MOVE objectId bucketId. On simpler problems, such as color matching, this is more or less what happens; although every once in a while Gemini decides to provide a long disquisition on its motivation and reasoning, its long message still includes the "MOVE ..." substring which our app extracts and uses as the next move. However, in more complicated problems (e.g. "counterclockwise") one can sometimes see Gemini slide into purely verbal descriptions of the move.

.... use 1 retry with an extra prompt ....

.... move already removed piece ... ... artifact of my reading -- he just talks about reasoning first, mentioning old moves, and then gives the new move at the end! ...

... "IMMOVABLE" means that the object you have tried to move cannot at present be moved to any bucket, and the object stays on the board. This mean that you should not try to attempt to move this object again until the board has changed in some way (i.e. some other object has been successfully removed).

Failing to realize that a piece has been removed already

This is rare, but has been seen on occasion in Gemini model 2.0. This is reported as code -10 (invlaid object ID) on our side.

Seemingly unprovoked. E.g., in gemini-sm_qcts-03.txt:

    During episode 5, you have made so far the following 15 move attempts, with the following results:
MOVE 8 3 ACCEPT
MOVE 6 3 DENY
MOVE 6 0 ACCEPT
MOVE 5 2 DENY
MOVE 4 0 DENY
MOVE 4 0 DENY
MOVE 4 2 ACCEPT
MOVE 5 0 ACCEPT
MOVE 0 2 ACCEPT
MOVE 1 1 ACCEPT
MOVE 3 1 DENY
MOVE 3 1 DENY
MOVE 3 1 DENY
MOVE 3 0 ACCEPT
MOVE 2 2 ACCEPT
YOUR MOVE?
Response text={Based on the previous episodes, I hypothesize the following:

*   **Bucket 0:** Accepts Squares
*   **Bucket 1:** Accepts Circles
*   **Bucket 2:** Accepts Triangles
*   **Bucket 3:** Accepts Stars

Therefore, I will now try to move the remaining object, which is a Blue Square with ID 5, to Bucket 0.

MOVE 5 0
}
    
So the bot for some reason repeats its earlier successful move, MOVE 5 0, even though it's staring right at him in the episode's transcript. I will try to help this problem in ver 8.006, by explicitly showing the current state of the board to the bot after each move.

Repeating a wrong move

With the model gemini-2.0-flash, this happens both on code 7 (IMMOVABLE) and code 4 (DENY). With Gemini 2.5 Flash, this is much less common; unless otherwise indicated, all examples below are from 2.0.

%    grep At gemini-ordL1-01.txt
    ...
At 2025-04-01 11:23:46, Moving piece 0 to bucket 0. Code=7
At 2025-04-01 11:23:51, Moving piece 0 to bucket 0. Code=7
At 2025-04-01 11:23:57, Moving piece 0 to bucket 0. Code=7
At 2025-04-01 11:24:02, Moving piece 0 to bucket 0. Code=7
At 2025-04-01 11:24:08, Moving piece 0 to bucket 0. Code=7
    ...
  
(ccw)
At 2025-04-01 11:48:43, Moving piece 6 to bucket 0. Code=4
At 2025-04-01 11:48:49, Moving piece 6 to bucket 0. Code=4

Sometimes this repetition happens just 2-3 times; but sometimes you can see the bot repeating the same failed move 30 times, but then, fortunately, trying something else.

In another twist on the same theme, sometimes Gemini alternates between seveveral failed attempts, cycling between them endlessly:

    ...
At 2025-06-12 01:14:07, Moving piece 3 to bucket 3. Code=4
At 2025-06-12 01:14:14, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:14:21, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:14:28, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:14:35, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:14:42, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:14:49, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:14:55, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:15:02, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:15:09, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:15:16, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:15:24, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:15:31, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:15:38, Moving piece 1 to bucket 1. Code=4
At 2025-06-12 01:15:45, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:15:52, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:15:59, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:16:05, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:16:12, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:16:19, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:16:26, Moving piece 1 to bucket 1. Code=4
At 2025-06-12 01:16:33, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:16:40, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:16:47, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:16:55, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:17:02, Moving piece 1 to bucket 3. Code=4
At 2025-06-12 01:17:09, Moving piece 1 to bucket 1. Code=4
At 2025-06-12 01:17:16, Moving piece 1 to bucket 1. Code=4
At 2025-06-12 01:17:23, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:17:30, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:17:37, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:17:43, Moving piece 1 to bucket 0. Code=4
At 2025-06-12 01:17:50, Moving piece 1 to bucket 1. Code=4
At 2025-06-12 01:17:57, Moving piece 1 to bucket 0. Code=4
...

This despite me explicitly telling the bot (in the "system instructions") not to repeat a failed move until the board changes.

With Gemini 2.5 Flash, this was so far only seen in a particularly difficult problem (which Gemini 2.5 wasn't able to solve in 300+ moves).

%    grep Movin gemini-colOrdL1_BRKY-250614-1520.txt
At 2025-06-14 23:41:48, Moving piece 8 to bucket 1. Code=0
At 2025-06-14 23:43:35, Moving piece 0 to bucket 2. Code=7
At 2025-06-14 23:45:55, Moving piece 1 to bucket 3. Code=7
At 2025-06-14 23:48:10, Moving piece 3 to bucket 1. Code=7
At 2025-06-14 23:49:41, Moving piece 0 to bucket 2. Code=7
At 2025-06-14 23:50:54, Moving piece 0 to bucket 2. Code=7
At 2025-06-14 23:52:29, Moving piece 3 to bucket 1. Code=7
At 2025-06-14 23:55:23, Moving piece 0 to bucket 2. Code=7

At 2025-06-14 23:56:49, Moving piece 4 to bucket 1. Code=0
   
In the above example, the sequence of 7 invalid moves (code 7 means IMMOVABLE) includes only 3 unique movements

Will try reminding about it ...

Timing and quotas

Since at the moment (April-June 2025) we are using Gemini as free-tier subscribers, we have to abide by the server's fairly strict usage quotas.

Gemini 2.0 Flash

Gemini 2.0 was pretty fast, so the main limiting factor was the free-tier quota, 15 moves per minute. By sleeping 5 seconds between requeststs, we by and large avoid hitting that limit. The daily quota for the free-tier users is 1500 requsts per day, which can be easily reached if you don't use let the bot to stupidly repeat the same bad move forever. This is why I usually limited Gemini 2.0 runs to, say, 600 requests.

As almost everything with the Gemini server, the limits and the way to reacting to the breaching of them are fluid, meaning that you cannot expect the same reaction every time. When the Gemini server decides that the limit is reached, it may respond with a detailed message, indicating how many seconds we should wait before retrying, in order to return to its good graces. Or it may send a more general message, in which case we try to sleep for 1 minute.

Gemini 2.5 Flash

In contrast, requests with Gemini 2.5 took a much longer time (typically, in the 30-120 sec, with the time per request usually slightly increasing as the game progresses), at least for a free-tier user. Thus the quota limits (10 request per minute, and 250,000 tokens per minute) are pretty much irrelevant. The daily limit, 500 requests per day, still can be achieved after a few hours of working.

Gemini results (Play mode) on the rule sets from the FDCL/basic suite

The play-mode results for the newer model, gemini-2.5-flash-preview-05-20 are usually much better than for gemini-2.0-flash, which rarely accomplishes anything of note. For each rule set below, unless two models are explicitly compared, the results are for the newer, better model.

In general, it seems that Gemini 2.5 Flash achieves Paul's suggested "mastery criterion" (10 successful move attempts in a row) on most of the rule sets from the FDCL/basic suite that we have tried. It does it fairly quickly, typically in under 100 moves (half a dozen episodes or so), and sometimes much faster than that. (Of course, since a Gemini 2.5 response takes up to 1-2 minutes to obtain, at least for free tier subscribers, we are still talking about the clock time on the order of an hour or several hours per problem).

When we end the run after the "mastery criterion" has been achieved, we explcitly ask the Gemini bot what it thinks the hidden rule set is, and it gives us a detailed explanation. Gemini 2.5 Flash being quite "talkative", it also fairly often volunteers an exlpanation of its current idea of the hidden rule, its confidence in this, and the hypothesi/es it plans to check.

The quality of the picture of the rule set that the Gemini bot has its mind at the point of reaching 10 good moves is highly variable.

A few cases where I couldn't get Gemini to reach the 10-good-moves mastery criteria after 200+ moves included shaOrdL1_csqt and colOrdL1_BRKY: the two rule sets with the postpos: predicate. This is the kind of rules that weren't available in the original Rule Game; the support for them (the postpos: predicate) was introduced in early 2025, when Paul proposed to include this kind of rules into the FDCL/basic suite. These are fairly difficult for humans too, since they combine two concepts (color and position) in a somewhat complicated way: the player should not just go for a red piece (or a square) in the topmost row, but should first think of all red pieces (or all squares), then isolate the top row of that set, and then pick the leftmost red piece (or the leftmost square).

quadNearby

The older model, gemini-2.0-flash bogs down here (not much progress after 600 moves; lots of repetitions of failed moves, and occasionally even grabbing at the already-vacated cell). OTOH, the new model, gemini-2.5-flash-preview-05-20 is somehow able to solve this without making a single error.

Based on the successful moves in Episode 1 and the first move in Episode 2, the secret rule appears to be determined solely by the object's position (X, Y) on the board, dividing the board into four 3x3 quadrants, each mapping to a specific bucket. The color and shape of the objects do not seem to play a role. Here's the rule: The 6x6 board is divided into four 3x3 quadrants based on the object's (X, Y) coordinates:
1.  **Top-Left Quadrant (X between 1-3, Y between 4-6):** Objects in this quadrant must be moved to **Bucket 0** (X=0, Y=7).
    *   *Example from Episode 1:* Object 6 (x=3, y=5) was moved to Bucket 0.
    *   *Objects currently on board in Episode 2 fitting this:*
        *   Object 3 (BLACK, TRIANGLE, x=3, y=4) -> Bucket 0
        *   Object 5 (BLACK, SQUARE, x=1, y=5) -> Bucket 0
        *   Object 6 (BLACK, SQUARE, x=2, y=5) -> Bucket 0

2.  **Top-Right Quadrant (X between 4-6, Y between 4-6):** Objects in this quadrant must be moved to **Bucket 1** (X=7, Y=7).
    *   *Examples from Episode 1:* Objects 4 (x=4, y=4), 5 (x=5, y=4), 7 (x=4, y=5), 8 (x=5, y=6) were moved to Bucket 1.
    *   *Objects currently on board in Episode 2 fitting this:*
        *   Object 4 (BLACK, CIRCLE, x=5, y=4) -> Bucket 1
        *   Object 7 (BLACK, SQUARE, x=4, y=5) -> Bucket 1
        *   Object 8 (BLUE, CIRCLE, x=5, y=5) -> Bucket 1

3.  **Bottom-Right Quadrant (X between 4-6, Y between 1-3):** Objects in this quadrant must be moved to **Bucket 2** (X=7, Y=0).
    *   *Examples from Episode 1:* Objects 0 (x=4, y=2), 2 (x=4, y=3), 3 (x=5, y=3) were moved to Bucket 2.
    *   *Objects currently on board in Episode 2 fitting this:*
        *   Object 2 (BLUE, CIRCLE, x=5, y=3) -> Bucket 2

4.  **Bottom-Left Quadrant (X between 1-3, Y between 1-3):** Objects in this quadrant must be moved to **Bucket 3** (X=0, Y=0).
    *   *Example from Episode 1:* Object 1 (x=2, y=3) was moved to Bucket 3.
    *   *Example from Episode 2 (your successful move):* Object 0 (x=1, y=1) was moved to Bucket 3.
    *   *Objects currently on board in Episode 2 fitting this:*
        *   Object 1 (YELLOW, TRIANGLE, x=2, y=3) -> Bucket 3
In summary, the secret rule is to determine which of the four 3x3 coordinate quadrants an object occupies and then move it to the specific b>ucket associated with that quadrant.
Or, in another run:
The secret rule appears to be based on the proximity of the object to the bucket corners, using the Manhattan distance. Here is my understanding of the rule: 1. **Calculate Manhattan Distance:** For any object at `(X, Y)` on the board, calculate its Manhattan distance to each of the four bucket corner coordinates:
    *   **Bucket 0 (X=0, Y=7):** Distance = `|X - 0| + |Y - 7|`
    *   **Bucket 1 (X=7, Y=7):** Distance = `|X - 7| + |Y - 7|`
    *   **Bucket 2 (X=7, Y=0):** Distance = `|X - 7| + |Y - 0|`
    *   **Bucket 3 (X=0, Y=0):** Distance = `|X - 0| + |Y - 0|`
2. **Determine the Target Bucket:** An object can only be moved to the bucket (or buckets, in case of a tie) that has the *minimum* Manhattan distance from the object's current position. All moves made so far in Episode 1 and the first move in Episode 2 are consistent with this rule. There is no indication that the rule depends on object properties (color, shape) or the history of previous successful moves.

The termination ("mastery") criterion in all runs was 10 correct moves in a row.

quadMixed1

While the old model bogs down here, the newer model reached the mastery criterion in 36 moves. The reasoning was explaineds as follows:

 The secret rule determines which bucket an object can be placed into based solely on its position (X,Y coordinates) on the board. The board is conceptually divided into four quadrants using the midpoints of the X and Y ranges (X=3.5 and Y=3.5).

Here's the detailed rule:

1.  **Board Quadrants Definition:**
    *   **Top-Left Quadrant (TL):** Cells where `X <= 3` AND `Y >= 4`
    *   **Top-Right Quadrant (TR):** Cells where `X >= 4` AND `Y >= 4`
    *   **Bottom-Right Quadrant (BR):** Cells where `X >= 4` AND `Y <= 3`
    *   **Bottom-Left Quadrant (BL):** Cells where `X <= 3` AND `Y <= 3`

2.  **Bucket Mapping Rule:**
    *   If an object is in the **Top-Left Quadrant (TL)**, it must be moved to **Bucket 1**. (Bucket 1 is located at X=7, Y=7, corresponding to the Top-Right corner).
    *   If an object is in the **Top-Right Quadrant (TR)**, it must be moved to **Bucket 3**. (Bucket 3 is located at X=0, Y=0, corresponding to the Bottom-Left corner).
    *   If an object is in the **Bottom-Right Quadrant (BR)**, it must be moved to **Bucket 2**. (Bucket 2 is located at X=7, Y=0, corresponding to the Bottom-Right corner).
    *   If an object is in the **Bottom-Left Quadrant (BL)**, it must be moved to **Bucket 0**. (Bucket 0 is located at X=0, Y=7, corresponding to the Top-Left corner).

In summary:
*   Objects in the **Bottom-Right** quadrant go to the **Bottom-Right** bucket (Bucket 2).
*   All other objects are moved to the *diagonally opposite* bucket relative to their quadrant, if we consider bucket positions as corresponding to board corners:
    *   Top-Left quadrant -> Top-Right bucket (Bucket 1)
    *   Top-Right quadrant -> Bottom-Left bucket (Bucket 3)
    *   Bottom-Left quadrant -> Top-Left bucket (Bucket 0)

This rule has successfully explained all ACCEPT and DENY outcomes in Episode 1, Episode 2, and the current moves in Episode 3. There have been no IMMOVABLE responses, indicating that all objects are always movable to their designated bucket.

sm_qcts (shape match)

This is one of the very few rules where even with Gemini 2.0, mastery was demonstrated after ca. 200 move attempts. The explanation has a model more or less consistent with reality, e.g.
Based on these observations, here's my hypothesis:
*   **Bucket 0:** Accepts `SQUARE` or `BLACK`
*   **Bucket 1:** Accepts `CIRCLE` or `RED`
*   **Bucket 2:** Accepts `TRIANGLE` or `BLUE`
*   **Bucket 3:** Accepts `STAR` or `YELLOW`
With Gemini 2.5 Flash, the bot solved the game in 23 move attempts. The rule description given by the bot was super-overtrained, but apparently mostly correct.

    

ordL1 (reading order)

Gemini 2.5 solved this in 26 moves.

Here's my understanding of the secret rule based on the observed episodes:

The game has two main rules that dictate moves:
1.  **Which object can be moved (Movability Rule):** Only one specific object is "movable" at any given time. If you attempt to move any other object, it will be "IMMOVABLE".
    *   The movable object is determined by its position (X, Y) on the board.
    *   It must have the **highest Y-coordinate** (i.e., be in the highest row) among all objects currently on the board.
    *   If there are multiple objects in that highest row, it must be the one with the **lowest X-coordinate** (i.e., the leftmost object in that row).
    *   This implies a "top-down, then left-to-right" priority for object removal. Objects in lower Y-coordinates (or to the right in the same Y-coordinate) become movable only after all objects above/to their left are removed.

2.  **To which bucket it must be moved (Bucket Assignment Rule):** Once an object is identified as movable, the correct bucket it must be placed into depends solely on its position (X, Y). If you choose the wrong bucket, the response will be "DENY".

    *   **If the object's Y-coordinate is 4 or greater (Y >= 4):**
        *   It must be moved to **Bucket 0** (the bucket located at X=0, Y=7). This applies to all objects in rows 4, 5, and 6.

    *   **If the object's Y-coordinate is 3 or less (Y <= 3):**
        *   **If the object's X-coordinate is 3 or less (X <= 3):**
            *   It must be moved to **Bucket 3** (the bucket located at X=0, Y=0). This applies to objects in rows 1, 2, or 3, and columns 1, 2, or 3.
        *   **If the object's X-coordinate is greater than 3 (X > 3):**
            *   It must be moved to **Bucket 2** (the bucket located at X=7, Y=0). This applies to objects in rows 1, 2, or 3, and columns 4, 5, or 6.

The object's color, shape, or ID, as well as the previous successful move, do not seem to influence which object can be moved or to which bucket it goes.

**Summary in simplified terms:**
*   Always clear the board from top to bottom, and within each row, from left to right.
*   Objects in the top half (rows 4, 5, 6) go to the top-left bucket (Bucket 0).
*   Objects in the bottom half (rows 1, 2, 3) go to either the bottom-left bucket (Bucket 3) if they are in the left half (columns 1, 2, 3), or the bottom-right bucket (Bucket 2) if they are in the right half (columns 4, 5, 6).

I will now proceed with the next move based on this rule.}

allOfColOrd_BRKY

( Colors in order. All blue before any red, etc.)

Solved in 51 moves (to 10 good moves), or 89 moves (to 25). After 10 good moves, the complicated explanation contains some unnecessary extraneous rules, with the correct (and sufficient) core rule:

... Objects must be cleared from the board in a strict sequence of colors: **BLUE -> RED -> BLACK -> YELLOW** ...

Weirdly, after 25 good moves the explanation given is entirely spurious, more along the lines of "order L1", although the bot admit that it's not entirely correct.

allOfColOrd_KRBY

(Similar) Demonstrated "mastery" (10 good moves in a row) after 47 moves. However, still has not worked out a good idea of the rules by that point.
Okay, I've analyzed the past episodes and the current state to try and figure out the secret rule. This has been quite a challenge, as some patterns appear and then contradict themselves across episodes. However, I have a working hypothesis that covers most of the observed behavior. ...
That follows by a very complicated theory (overfitting indeed!) which starts with a position-based approach (4 quarters) combined with other things, among which are colors too:

The bot is fully cognizant of the limitations of the current approach, so if it were told to play longer -- say up to 20 good moves in a row, or up to some kind of R-value (Bayesian) threshold -- it probably would discover the real rule.

allOfShaOrd_csqt

(shape:circle)
(shape:star)
(shape:square)
(shape:triangle)

The 10-good-move criterion is achieved after 115 move attempts, but the explanation is largely incorrect. The bot apparently thinks that the movability criteria are different in different episodes.

allOfShaOrd_qcts

Achieved the mastery criterion (10 good moves) after 40 moves, but the explanation is entirely wrong (some weird position-based algo). Should try more moves...

buckets_2130

Found the correct rule in 26 moves:
    ...
       More formally:
    *   If `N % 4 == 0` (i.e., the 1st, 5th, 9th, etc., successful move), the object must go to **Bucket 2**.
    *   If `N % 4 == 1` (i.e., the 2nd, 6th, 10th, etc., successful move), the object must go to **Bucket 1**.
    *   If `N % 4 == 2` (i.e., the 3rd, 7th, 11th, etc., successful move), the object must go to **Bucket 3**.
    *   If `N % 4 == 3` (i.e., the 4th, 8th, 12th, etc., successful move), the object must go to **Bucket 0**.

cm_KRBY

Found the correct rule (color matching) after 31 moves.

cm_RBKY_cw_0123

(Need to pick colors in order, and use the designated bucket for each one) Found essentially the correct rule (with some extraneous additions) in 80 moves.
 ...
**Rule 1: Bucket Destination (Color-based)**
This rule determines which bucket an object can be placed into. It is solely based on the object's **color**:
*   **RED** objects must be placed into **Bucket 0** (X=0, Y=7, Top-Left).
*   **BLUE** objects must be placed into **Bucket 1** (X=7, Y=7, Top-Right).
*   **BLACK** objects must be placed into **Bucket 2** (X=7, Y=0, Bottom-Right).
*   **YELLOW** objects must be placed into **Bucket 3** (X=0, Y=0, Bottom-Left).

  *Explanation:*

  * If you attempt to move an object to a bucket that does not match
  its color, you will receive a "DENY" response.

  * All successful "ACCEPT" moves observed across all episodes
  strictly adhere to this color-to-bucket mapping.

**Rule 2: Object Availability (Movability Sequence)**

  This rule determines *which* object is currently movable on the
  board. All other objects will yield an "IMMOVABLE" response if you
  try to move them.

  1.  **First Move of an Episode**: The very first object that can be
  moved in any episode is always the **RED object with the lowest ID**
  among all RED objects currently on the board.
  
  2.  **Subsequent Moves**: After an object is successfully moved to a
  specific bucket (let's call it `Bucket N`), the next object(s) that
  become movable are determined by a cyclical sequence of buckets:
  
    *   If the previous object was placed in **Bucket 0** (RED), then the next movable object(s) will be **BLUE** objects.
    *   If the previous object was placed in **Bucket 1** (BLUE), then the next movable object(s) will be **BLACK** objects.
    *   If the previous object was placed in **Bucket 2** (BLACK), then the next movable object(s) will be **YELLOW** objects.
    *   If the previous object was placed in **Bucket 3** (YELLOW), then the next movable object(s) will be **RED** objects.

     * **Skipping Colors**: If there are no objects of the target color
  (e.g., no BLUE objects left when Bucket 0 was the last target), the
  rule cycles to the next color in the sequence until an available
  object is found (e.g., after Bucket 0, if no BLUE objects, try
  BLACK, then YELLOW, then RED again).

  3.  **Specific Object Selection (within a color group)**: If there
  are multiple objects of the designated "next movable" color on the
  board, only *one specific object* among them is made movable by the
  game at that moment. The exact criteria for selecting this specific
  object (e.g., based on X/Y coordinates, shape, or a specific ID
  ordering other than lowest ID) are not fully apparent from the
  provided logs, as there were inconsistent patterns (sometimes lowest
  ID, sometimes not). However, the "IMMOVABLE" response indicates that
  an object is not the designated one.

col1Ord_BRKY

No solution after 330 moves. Tantalizingly, the bot did at least once come up with a good explanation... but did not re-generate it on subsequent moves.

col1OrdBuck_BRKY0213

1 (color:blue, bucket:0)
1 (color:red, bucket:2)
1 (color:black, bucket:1)
1 (color:yellow, bucket:3)
The bot reached the mastery criterion (10 good moves) after 104 moves, but admits that it only knows the rules partially: it knows which color goes to which buckets, but is not sure in which order the pieces need to be picked; wants to continue experimenting.

colOrdL1_BRKY

1 (color:blue, postpos:L1)
1 (color:red, postpos:L1)
1 (color:black, postpos:L1)
1 (color:yellow, postpos:L1)
Failed to solve. I interrupted the job after 291 moves, when the bot seemed to had long been stuck in some wrong concept of the rules, mostly based on postion (quadrants). Also, this was a fairly rare case of Gemini 2.5 repeating failed moves several times.

cw_0123

Solved in 15 moves. Described the rules well.

ordL1_Nearby

Fully solved in 15 moves. (25 good moves follow as well).

ordRevOfL1_Nearby

Fully solved in 95 move attempts. Good explanation given ("snake scan" :-). (25 good moves soon follow as well).

ordRevOfL1_Remotest

Order RevOfL1=[6,5,4,3,2,1, 12,11,10,9,8,7, 18,17,16,15,14,13, 24,23,22,21,20,19, 30,29,28,27,26,25, 36,35,34,33,32,31]
(pos:RevOfL1, bucket:Remotest)
Achieved the 10-good-moves mastery criterion after 158 moves, but the discovered rule is not correct. The bot figured out early on that a piece can only be put to the most remote bucket, and which piece must be picked first. The rules for subsequent pickings make little sense though.

ordRevOfL1

Solved in 30 moves. Discovered the correct pick rule (reverse of L1), with a spurious restrictions on destination buckets (it thinks one must use the nearby bucket). The spurious restriction seems to be based on positive evidence (yes of course you can use that bucket), without any negative evidence:
* **Bucket Selection (DENY/ACCEPT):**
* Across both Episode 1 and Episode 2, every successful move ("ACCEPT") consistently placed the object into the bucket corresponding to its quadrant as defined above. There were no "DENY" responses in the provided logs, but the consistent "ACCEPT" pattern strongly suggests this quadrant-based mapping for bucket selection.

sha1OrdBuck_qcts0213

1 (shape:square, bucket:0)
1 (shape:circle, bucket:2)
1 (shape:triangle, bucket:1)
1 (shape:star, bucket:3)
Correctly solved in 172 moves. Good description of the rules.

sha1Ord_csqt

1 (shape:circle)
1 (shape:star)
1 (shape:square)
1 (shape:triangle)
The mastery criterion is met after 87 moves, but the hypothesized rules are entirely wrong. (The bot knows that it needs to start an episode with a circle, but then has a complicated position-based rule, which has nothing to do with the true secret rule.)

shaOrdL1_csqt

1 (shape:circle, postpos:L1)
1 (shape:star, postpos:L1)
1 (shape:square, postpos:L1)
1 (shape:triangle, postpos:L1)
Lots of repeats of a bad move; likely won't solve it. I interrupted it after 200 move attempts.

action w2020/gemini-2.5-flash-logs> more /opt/w2020/game-data/rules/pk/shapeVcolor.txt # to see whether naive pplayers are more likely to find shape or color match. (shape:square,bucket:0) (shape:circle,bucket:1) (shape:triangle,bucket:2) (shape:star,bucket:3) (color:red,bucket:0) (color:yellow,bucket:1) (color:green,bucket:2) (color:blue,bucket:3)

Cost estimates

If we switch to a paid tier, how much would playing a game cost?

For Gemini 2.5 flash model (thinking ones) Google quotes the price of $0.15 per million input tokens, and $3.50 per million output tokens.

A token, more or less, is a word or a punctuation mark. For a medium-difficulty rule set, where the mastery criterion was achieved at 80 moves, and an essentially correct rule was discovered, the stats for the log file look like this:

  
~/w2020/gemini_logs-3/victory> grep -c Moving  gemini-cm_RBKY_cw_0123-250613-0953.txt | more
80
~/w2020/gemini_logs-3/victory> wc  gemini-cm_RBKY_cw_0123-250613-0953.txt 
  8397  62689 554826 gemini-cm_RBKY_cw_0123-250613-0953.txt

This log consists mostly of the texts of requests and responses. We can separate responses as follows:

  perl -pe 's/\n/ /sg' gemini-cm_RBKY_cw_0123-250613-0953.txt > a.tmp
  perl -pe 's/.*?Response text=\{(.+?)\}/$1\n/g' a.tmp > b.tmp

~/w2020/gemini_logs-3/victory>  wc b.tmp 
    81  28682 175880 b.tmp

So the rough estimate is that in this session the responses included 28682 words, and requests, the rest, i.e. ca. 34,000 words. Assuming 1.5 tokens per word, that will give us ca. 42,000 tokens in the responses and 51,000 in the requests, for the total cost estimate of

51*0.001 * 0.15 + 42*0.001 * 3.50 = $0.15
for playing that game.

As the large number of tokens indicates, thinking models such as Gemini 2.5 Flash are quite talkative, and in response to many requests they explain their reasoning even without me asking them. The cost probably can be reduced quite a bit by explicitly telling the bot NOT to provide any reasoning unless asked.

Google tells us that

Implicit caching is enabled on Gemini 2.5 models by default. If a request contains content that is a cache hit, we automatically pass the cost savings back to you.

Google also advertises its "Gemini 2.5 Pro Previw" model, which supposedly "excels at coding and complex reasoning tasks". It's priced significantly higher; with that model, playing the same game as above would cost approximately

51*0.001 * 2.50 + 42*0.001 * 15 = $0.75.

The cost of a game depends, of course, on how many moves it involved. It's fair to assume that the number of input tokens grows quadratically, and that of output tokens linearly with the number of moves. So if we had a similar game with 800 moves instead of 80, the cost estimate would be

51*100*0.001 * 0.15 + 42*10*0.001 * 3.50 = $2.20
for Gemini 2.5 Flash, and
51*100*0.001 * 2.50 + 42*10*0.001 * 15 = $19.
for Gemini 2.5 Pro.

And the thinking costs money too...

Addendum 2025-06-19: I just realized that with the thinking models the pricing is based not only on the visible output tokens but also on the invisible "thought tokens"; those are priced same as output tokens.

Those thinking tokens run up quite quickly: on some runs with temperature=0 I had the "totalTokenCount" (the sum of output token count and thought token count) exceed 67K even before the targetStreak=10 was reached. So if we have 100 moves needed to solve a problem, with 50k total per move, that's 5 million total tokens; priced at $2.50/million (in G2.5 Flash), that would cost $12.50 per problem; with G2.5 Pro (at $10/million), that would be $50 per problem.

Gemini queries with "prepared episodes"

I carried out a series of runs where the Gemini bot was shown 10 prepared episodes for each rule and ask to deduce the rule. In another series, the bot was shown 100 prepared episodes per rule.

The runs were conducted with the default temperature. In each run, I asked Gemini to provide 3 candidate responses. (Gemini being non-deterministic, it can always come up with multiple answers, if asked). However, why the 3 responses could be very different textually, I usually won't see much difference between their degree of correctness. In those cases when the responses were substantially different, I listed the individual grades, seaprated by commas.

The results are in /home/vmenkov/gemini/prepared-10 and /home/vmenkov/gemini/prepared-100

The grading key:

Rule setBot playsprep 10 prep 100 Notes
allOfColOrd_BRKYCCposition-based theory
allOfColOrd_KRBYB+C,C+,Cguessed KBRY instead of KRBY
allOfShaOrd_csqtC+Cposition-based theory, but knows that can start with a circle
allOfShaOrd_qctsC+C+knows to start with a square
buckets_2130AC,A,B-
ccwAA
cm_KRBYAA
cm_RBKYAC,C,B
cm_RBKY_cw_0123AB+,A,A
col1Ord_BRKYCCTalks about target buckets
col1OrdBuck_BRKY0213BB-,C,B
col1OrdBuck_BRKY3120B+
col1Ord_KRBYC
colOrdL1_BRKYC
colOrdL1_KBYRC
cw_0123A
cwA
ordL1A-Describes the correct L1 rule, but in terms of "likelihood", rather than certainty
ordL1_NearbyC+,B-
ordRevOfL1C
ordRevOfL1_NearbyC+Figures the bucket assignment, but not the RevOfL1 order
ordRevOfL1_RemotestC+Deduces statistical bucket-based rules, rather than position-based, but they are better than chance, because bucket and position correlate
quadMixed1A
quadNearbyA
sha1OrdBuck_qcts0213A
sha1OrdBuck_tqsc0213B+Claimed slightly more permissive rules than the real ones, and dismissed contradicting facts as "anomalies"
sha1Ord_csqtC
sha1Ord_qctsC
shaOrdL1_csqtC
shaOrdL1_qctsC,B,COne candidate figured the order of shapes, but not the position restriction
sm_csqtA
sm_qctsA

Training on a smaller number of episodes. This experiment was organized as follows: in each run, the bot was shown a number of completed prepared episodes (1, 2, 3, 5, or 10) played with some hidden rule, asked to guess that rule, and to provide, for 5 additional initial boards, with N objects each (N=9), a plan to clear that board in N moves (i.e. without making any errors in that "future episode"). The bot was asked to provide 3 response candidates; thus, in each run we had up to 15 proposed solutions (up to 3 for each initial board) to look and evaluate. (Sometimes, of course, the number was smaller than 15, because the bot sometimes fails to provide everything what's asked of it). The table below contains, for each run, the number of fully correct proposed solutions among these 5*3 = 15 future episodes.

Two types of system instruction were provided: the basic ones (series b) and expanded ones, which explicitly described a number of sample rule sets (series a). The files are, respectively, in /opt/w2020/game-data/gemini/system-prepared-03.txt and system-prepared-02.txt:

action game-data/gemini> pwd
/opt/w2020/game-data/gemini
action game-data/gemini> dir system-prepared-0?.txt
-rw-rw-r-- 1 tomcat tomcat 6006 Jul 22 02:05 system-prepared-02.txt
-rw-rw-r-- 1 tomcat tomcat 4713 Jul 22 11:32 system-prepared-03.txt

For each rule set, the same RG seed (seed=1) is used in all runs; therefore, the set of 3 episodes shown to the bot e.g. in the prepared-a-3 run is the same as the one shown in the prepared-b-3 run, and a subset of the set shown in the prepared-a-5 and prepared-b-5 runs.

For comparison, the table also contains the m* ("moves to mastery") numbers from the play-mode experiments, for those rule sets where a play-mode experiment was conducted and a level of "mastery" (10 error-free moves) was displayed. The value "Infinity" is used when the play was interrupted, without displaying "mastery", after a large number of moves (typically, at least 300).
Rule setPlay-mode runs:
m* (moves until mastery)
Prepared-episodes runs:
Good future episodes (out of 15)
Instr. with examples Instructions w/o examples
Directory prepared-5-a prepared-10-a prepared-1-b prepared-2-b prepared-3-b prepared-5-b prepared-10-b
Number of prepared episodes5 10 1 2 3 5 10
allOfColOrd_BRKY3915 0 2 0 0 0 0
allOfColOrd_KRBY375 13 3 1 1 0 0
allOfShaOrd_csqt1050 0 0 0 0 0 0
allOfShaOrd_qcts300 0 0 0 0 0 0
buckets_21301715 10 15 10 10 10 15
ccw5 15 15 15 15 10 15
cm_KRBY2215 10 10 15 10 10 9
cm_RBKY15 15 10 15 10 7 15
cm_RBKY_cw_0123713 0 0 0 1 1 0
col1OrdBuck_BRKY0213951 0 1 0 1 0 0
col1OrdBuck_BRKY31201 0 0 0 2 0 0
col1Ord_BRKYInfinity 0 0 0 0 0 0 0
col1Ord_KRBY0 0 0 0 0 0 0
colOrdL1_BRKYInfinity0 0 0 0 0 0 0
colOrdL1_KBYR0 0 0 0 0 0 0
cw37 15 15 15 15 15 15 10
cw_0123615 15 15 10 2 12 10
ordL1160 1 0 0 0 0 0
ordL1_Nearby60 0 0 0 5 0 5
ordRevOfL1202 1 0 2 0 0 0
ordRevOfL1_Nearby860 0 0 0 0 0 0
ordRevOfL1_Remotest1480 0 0 5 0 0 0
quadMixed12710 0 10 15 15 10 10
quadNearby110 15 15 15 11 9 14
sha1OrdBuck_qcts02131630 0 0 0 0 0 0
sha1OrdBuck_tqsc02130 0 0 0 0 1 0
sha1Ord_csqt770 0 0 0 0 0 0
sha1Ord_qcts0 0 0 0 0 0 0
shaOrdL1_csqtInfinity 0 0 0 0 0 0 0
shaOrdL1_qcts0 0 0 0 0 0 0
sm_csqt10 8 0 15 2 10 0
sm_qcts1410 10 5 15 15 5 4

For the data from this table broken down into groups see the group breakdown table

To ward off Paul's suggestion that the bot perhaps may somehow "transfer learning between runs", we also had a series of experiments where a different RG seed was used for each run. (Otherwise, the experiments were organized the same way as those described in the previous table). It showed no quantitive difference from the above.
(b) Instructions w/o examples
Directory prepared-1-b prepared-2-b prepared-3-b prepared-5-b
Number of prepared episodes1 2 3 5
allOfColOrd_BRKY3 0 0 0
allOfColOrd_KRBY0 0 0 0
allOfShaOrd_csqt0 0 0 0
allOfShaOrd_qcts0 0 3 4
buckets_213010 10 14 15
ccw15 15 15 15
cm_KRBY10 15 14 10
cm_RBKY15 9 13 15
cm_RBKY_cw_01230 0 4 0
col1OrdBuck_BRKY02130 0 0 0
col1OrdBuck_BRKY31201 0 0 2
col1Ord_BRKY0 0 0 0
col1Ord_KRBY0 0 0 0
colOrdL1_BRKY0 0 0 0
colOrdL1_KBYR0 0 0 0
cw15 15 15 15
cw_012315 15 15 15
ordL10 0 0 0
ordL1_Nearby4 0 5 0
ordRevOfL10 0 0 0
ordRevOfL1_Nearby0 8 0 0
ordRevOfL1_Remotest0 0 0 0
quadMixed15 10 11 11
quadNearby15 15 15 15
sha1OrdBuck_qcts02130 0 0 0
sha1OrdBuck_tqsc02131 1 0 0
sha1Ord_csqt0 0 0 0
sha1Ord_qcts0 0 0 0
shaOrdL1_csqt0 0 0 0
shaOrdL1_qcts0 0 0 0
sm_csqt0 11 15 15
sm_qcts10 10 12 10