The Export Transcripts tool

Updated 2024-12-02, for Game Server 6.046

Introduction

The Rule Game server and the suite of tools that come with it offer a variety of tools to access and process the accumulated game data. Depending on one's needs and inclinations, one can simply use the data as they are, by reading the transcript CSV files and sending queries to the SQL database; one can copy the entire data set (CSV files and SQL tables) to your own computer; one can divide transcript files into sections associated with different rules sets, and add to them informaiton about the statistical difficulty of correct moves.

This document describes one of the simplest tools, the script export-transcripts.sh. Its primary purpose is to select the transcript files from the participants who satisfy your criteria (e.g. "everybody listed in study summary files on Prolific"; "everybody who participated in experiment plans FDCL/basic, and whose player ID started with prolific- or RU-", etc). Additionally, it will also extract the selected players' rows from the tables in the MySQL database, and will export them in CSV format. All files so selected or produced will be put into a new directory, so that you can ZIP it and then transfer the ZIP file to another computer for further analysis.

Use case examples

The script export-transcripts.sh takes most of the same command line arguments as does analyze-transcripts.sh, allowing selecting records based on experiment plan name, player IDs, user ID (for repeat players), etc. This section discusses a few examples that are likely to occur in your practice.

Example 1: Using Prolific summary files

If you run an "experiment" on Prolific, Prolific stores the list of participants for each "study" included in the experiment. You can download that list by going to the study's detail page, opening the "Actions" pop-up menu, choosing "Download summary", and saving the CSV file offered to you.

The download summary file is in CSV format with many columns; it looks like this:

% more prolific_export_671b8ae84509b9a624b95e13.csv
  Submission id,Participant id,Status,Custom study tncs accepted at,...
671b8e95b17f88050648883e,5e9088a8a4b6ed018b450000,APPROVED,Not Applicable,....
671b8e98fd64b1b65fa1a726,5c655e3c42faef0001280000,APPROVED,Not Applicable,....
....

The name of the file contains the Prolific study ID, while the second column of the file contains the participants' Prolific PID. (Note that it only includes the real Prolific participants, to whom you pay via the Prolific platform; it does not have an entry for yourself doing "preview as participant" of the study).

When the Prolific participants play our games, their Prolific PIDs are converted to our Game Server player IDs, which combine the study ID and the Prolific PID, e.g. prolific-671b8ae84509b9a624b95e13-5e9088a8a4b6ed018b450000.

Once you have downloaded the study summary files for your experiment to your PC, you need to upload them to the Game Server host (action.rutgers.edu) using sftp or scp. Put them into the same directory under your home directory, e.g. fdcl-01.

Log in to the Game Server host using ssh, go to the directory where you have put the summary files, and run the export transcripts script:

      cd fdcl-01
      /home/vmenkov/w2020/game/scripts/export-transcripts.sh -pid -file prolific_export_671*.csv

If the script runs successfully, you will find that it has created a directory named tmp, with the following content:

    % ls -l tmp
total 8
drwxr-xr-x  75 vmenkov  staff  2400 Nov  6 19:54 detailed-transcripts/
-rw-r--r--   1 vmenkov  staff   928 Nov  9 15:22 pid.csv
drwxr-xr-x   4 vmenkov  staff   128 Nov  9 15:22 sql/
drwxr-xr-x  75 vmenkov  staff  2400 Nov  6 19:54 transcripts/

Here, the subdirectories transcripts and detailed-transcripts contain copies of the relevant CSV files from Game Server's saved-files directories of the same names, with any duplicate episode records removed. The file pid.csv contains the list of the Game Server player IDs, and the subdirectory sql contains relevant extracts from the SQL server's tables as well as the scoring date (see the section "Score data" below)

You can now make a ZIP file containing the entire content of the directory tmp:

      zip -r tmp.zip tmp

and then transfer tmp.zip to your personal computer, or any another host, for further analysis. After that, you can remove the directory tmp the Rutgers server, or rename it somehow, so that its content would not be confusing for any future run of the export transcript tool at the same location.

Example 2: Selecting by player ID pattern

Suppose you ran an experiment without using Prolific, by directly sending an invitation URL to a group of students. You have ensured that all participants' player IDs match a certain pattern, e.g. they start with RU-FDCL-basic-. In this example we also assume that you want to include all players with a matching ID into your analysis, because you have never used that exact prefix for your preview. In this case, you can use the following command line:

 /home/vmenkov/w2020/game/scripts/export-transcripts.sh -pid 'RU-FDCL-basic-%'

(Note the single quotes and the percent sign, rather than asterisk, in the player ID pattern. This is the syntax needed to instruct the MySQL server to carry out "pattern matching", which is similar to wildcard expansion for file names in a UNIX shell, but uses different special characters.)

The rest of the processing is the same as in Example 1.

Example 3: Selecting by player ID pattern, with some exceptions

What if all player IDs that you're interested in all start with the same prefix... but not all player IDs that start with that prefix are relevant? In other words, you want to exclude some player IDs from your analysis, for example because they are associated with your own testing, rather than with recruited players?

If that is the case, you may need to hand-edit the player list. First, prepare the list of all player IDs matching the pattern. In this example, you'll start by putting that list into the file my-pids.csv in your home directory:

      cd /opt/w2020/saved/transcripts
      /bin/ls RU-FDCL-basic-*.transcript.csv > ~/my-pids.csv

After that, go to your home directory and edit the player ID list file, removing the IDs you want to exclude. You can use your favorite text editor (emacs, vi, vim, nano... whatever!)

      cd
      emacs my-pids.csv

(All lines in this file end with the suffix "transcript.csv", but that's OK; the export script will ignore this suffix, and look only at the player ID itself).

After that, you can feed the CSV file to the export transcript tool, much like in Example 1:

 /home/vmenkov/w2020/game/scripts/export-transcripts.sh -pid -file my-pids.csv

The subsequent processing is the same as in Example 1.

Example 4: Combining Prolific summary files and local player ID lists

What if an experiment involved both players recruited from Prolific (listed in Prolific study summary files) as well as locally recruited players (identified by their player ID prefix)? In this case, to analyze all players' data together, you can combine the approaches of Examples 1 and 3.

As in Example 1, download study summary files from Prolific, and upload them to your directory on the Rutgers server.
As in Example 3, prepare a CSV file ( my-pids.csv) with a list of locally-recruited players with appropriate prefix in their player IDs, and put that file to the same directory. If necessary, edit this file to remove any "preview" players.

In the directory where your Prolific summary files and your PID file are, feed both of them to the export script:

      /home/vmenkov/w2020/game/scripts/export-transcripts.sh -pid -file prolific_export_671*.csv  my-pids.csv

The subsequent processing is the same as in Example 1.

Score data

The subdirectory sql contains relevant extracts from the SQL server's tables, as well as 2 files with computed numbers, playerStats.csv and scores.csv:

% ls -l
total 436
-rw-r--r-- 1 vmenkov users 374594 Nov 29 23:47 Episode.csv
-rw-r--r-- 1 vmenkov users  40133 Nov 29 23:47 PlayerInfo.csv
-rw-r--r-- 1 vmenkov users  21470 Nov 29 23:47 playerStats.csv
-rw-r--r-- 1 vmenkov users    743 Nov 29 23:47 scores.csv

playerStats.csv

This file contains some processed information about each player's interaction with the first rule set of his trial list (i.e. the rule set for which he was a "naive player"). The content looks like this:

#playerId,trialListId,seriesNo,rule,maxToRemove,episodes,attempts,removed,couldAlsoAttempt,couldAlsoRemove,finishCode,xFactor
prolific-671b8ae84509b9a624b95e13-5e9088a8a4b6ed018b45ba7b,basic-02-A,0,quadNearby,27,2,12,11,16,16,5,4
prolific-671b8ae84509b9a624b95e13-5e4d01bbca55661437cd235b,basic-03-A,0,quadNearby,27,3,46,27,0,0,1,0
prolific-671b8ae84509b9a624b95e13-5c655e3c42faef0001283e77,basic-03-B,0,quadMixed1,27,3,93,27,0,0,1,0
prolific-671b8ae84509b9a624b95e13-5e64e9aae18d1524c7743b5d,basic-04-A,0,allOfColOrd_BRKY,27,3,37,27,0,0,1,0
prolific-671b8ae84509b9a624b95e13-613018599782583980fbf406,basic-04-B,0,allOfShaOrd_qcts,27,3,36,27,0,0,1,2
...

The columns' values are as follows:

playerId -- the player ID recorded in the Game Server,
trialListId -- the trial list assigned to that player
seriesNo -- the zero-based series number (always 0, since we only look at the first series played by each player)
rule -- the rule set name
maxToRemove -- the total number of objects that the player would need to remove if he clears all boards that the server would present to players for this rule set. For example, if the parameter set calls for 3 boards with 9 objects each, this number would be 27. Note that this number is based on the max_objects column of the trial list file, which means that if your trial list calls for a variable number of object (min_objects != max_objects), or if you used the BONUS incentive scheme, this value won't be correct.
episodes -- the number of episodes this player has actually played. This may be smaller than max_boards if the player gave up or walked away in the middle of an episode, or if he was granted an "early win" due to good performance.
attempts -- the number of attempted moves/picks this player has made
removed -- the total number of objects removed by this player from the boards
couldAlsoAttempt -- if the player ended with an "early win" (without actually clearing all his boards), he would have need to make at least this many moved to actually clear all boards. This is computed as maxToRemove-removed. For all other players, 0.
couldAlsoRemove -- if the player ended with an "early win" (without actually clearing all his boards), this is how many objects he wasn't given an opportunity to remove. This is the same number as the one above. For all other players, 0.
finishCode -- the finish code for the last played episode. (1 for success, 3 for "give up", 5 for "early win")
xFactor -- in games with DOUBLING or LIKELIHOOD incentive schemes, this is 2 or 4 for players who achieved doubling or quadrupling of their scores, 0 for all other players. In all other games, 0.

scores.csv

Aggregate information about the naive player's performance on each rule set.

	sql> more scores.csv
#rule,n,avgAttempts,score,scoreTimesObjectCnt,x
allOfColOrd_BRKY,10,59.1000,2.2074,59.6000,0.30000000
allOfShaOrd_qcts,11,51.1818,1.9327,52.1818,0.50000000
ccw,10,47.8000,1.9333,52.2000,0.45000000
cm_RBKY,9,36.1111,1.4897,40.2222,0.77777778
col1OrdBuck_BRKY0213,12,117.6667,4.3580,117.6667,0.20833333
col1Ord_BRKY,10,54.4000,2.0985,56.6591,0.35000000
colOrdL1_BRKY,10,95.4000,3.6565,98.7252,0.15000000
cw,11,56.1818,2.2803,61.5675,0.50000000
ordL1,22,65.6364,2.6875,72.5625,0.56818182
ordL1_Nearby,10,84.5000,3.2741,88.4000,0.65000000
ordRevOfL1,10,70.8000,2.7000,72.9000,0.55000000
ordRevOfL1_Remotest,11,113.0000,4.4607,120.4393,0.40909091
quadMixed1,10,81.7000,3.1303,84.5172,0E-8
quadNearby,21,52.5238,2.0970,56.6190,0.54761905
sha1OrdBuck_qcts0213,12,97.1667,3.9731,107.2727,0.58333333
sha1Ord_qcts,11,54.0909,2.1246,57.3636,0.40909091
shaOrdL1_qcts,10,80.5000,3.1074,83.9000,0.40000000
sm_csqt,10,47.4000,1.8815,50.8000,0.65000000

The semantics of the columns is as follows:

rule - the rule set name. (With the directory name removed if all rule sets in the experiment come from the same directory)
n - the number of naive players who played that rule set.
avgAttempts - a simplistically computed "difficulty score": the average number of move attempts made by a player on that rule set. If all players cleared all boards shown to them, this would be a very good metric; but since in real life some players give up or walk away, this can be improved a bit.
score - a "corrected score", as described below.
scoreTimesObjectCnt - the "corrected score" multiplied by the expected number of objects that a player would have to remove to clear all boards. If all players have cleared all boards, this number would be exactly the same as avgAttempts.
scoreTimesObjectCnt = score*maxToRemove for this rule set. If all players have cleared all boards, this would have been exactly equal to avgAttempts.
x - the value of xFactor, averaged over all players who played this rule. In games with the DOUBLING or LIKELIHOOD incentive schemes, this value may range from 0 (no player displayed a good level or learning) to 1.0 (all players demonstrated a good level or learning). In all other games, 0.

The corrected score for a rule set is computed as

    score = sum( attempts(P) + couldAlsoAttempt(P) ) / sum( removed(P) + couldAlsoRemove(P)),

with the summation over all players P who played that rule set as the first rule set in their trial list.

The values being summed are defined as

attempts(P) = the number of attempts (succesful or not) that the player P made when playing this rule set
removed(P) = the number of game pieces that the player P made when playing this rule set
couldAlsoRemove(P) = 0 for all players other than "early winners"; for "early winners" (the players who showed mastery), it is maxToRemove-removed(P), where maxToRemove is the total number of objects on all boards shown to a player who playes all episodes to the end.
couldAlsoAttempt(P) = couldAlsoRemove(P)

In other words, for each "early winner" (somebody who played so well that we ended his game) we have notionally "extended" his transcript as if, instead of being stopped, he continued removing all game pieces in all episodes of the series without making any additional errors. (This may be a bit over-optimistic, of course, but not as over-optimistic as simply counting that player's moves without regard to how early he was stopped). For other players, we take their actual transcripts (which may include fewer than maxToRemove (=27 in FDCL/basic)) removed pieces, if they formally gave up or just walked away). Then we compute the weighted average of attemptedMoves/piecesRemoved over all n players.

The value of this "corrected score" would be 1.0 if nobody made any errors, and is proportionally higher as people make more errors. If every player had played all 3 episode to completion, thus removing exactly maxToRemove=27 game pieces, then scoreTimesObjectCnt=score*maxToRemove would be exactly equal to avgAttempts (which, I think, is pretty much the score used by Jacob).