The Export Transcripts tool

Updated 2024-12-02, for Game Server 6.046

Introduction

The Rule Game server and the suite of tools that come with it offer a variety of tools to access and process the accumulated game data. Depending on one's needs and inclinations, one can simply use the data as they are, by reading the transcript CSV files and sending queries to the SQL database; one can copy the entire data set (CSV files and SQL tables) to your own computer; one can divide transcript files into sections associated with different rules sets, and add to them informaiton about the statistical difficulty of correct moves.

This document describes one of the simplest tools, the script export-transcripts.sh. Its primary purpose is to select the transcript files from the participants who satisfy your criteria (e.g. "everybody listed in study summary files on Prolific"; "everybody who participated in experiment plans FDCL/basic, and whose player ID started with prolific- or RU-", etc). Additionally, it will also extract the selected players' rows from the tables in the MySQL database, and will export them in CSV format. All files so selected or produced will be put into a new directory, so that you can ZIP it and then transfer the ZIP file to another computer for further analysis.

Use case examples

The script export-transcripts.sh takes most of the same command line arguments as does analyze-transcripts.sh, allowing selecting records based on experiment plan name, player IDs, user ID (for repeat players), etc. This section discusses a few examples that are likely to occur in your practice.

Example 1: Using Prolific summary files

If you run an "experiment" on Prolific, Prolific stores the list of participants for each "study" included in the experiment. You can download that list by going to the study's detail page, opening the "Actions" pop-up menu, choosing "Download summary", and saving the CSV file offered to you.

The download summary file is in CSV format with many columns; it looks like this:

% more prolific_export_671b8ae84509b9a624b95e13.csv
  Submission id,Participant id,Status,Custom study tncs accepted at,...
671b8e95b17f88050648883e,5e9088a8a4b6ed018b450000,APPROVED,Not Applicable,....
671b8e98fd64b1b65fa1a726,5c655e3c42faef0001280000,APPROVED,Not Applicable,....
....
  
The name of the file contains the Prolific study ID, while the second column of the file contains the participants' Prolific PID. (Note that it only includes the real Prolific participants, to whom you pay via the Prolific platform; it does not have an entry for yourself doing "preview as participant" of the study).

When the Prolific participants play our games, their Prolific PIDs are converted to our Game Server player IDs, which combine the study ID and the Prolific PID, e.g. prolific-671b8ae84509b9a624b95e13-5e9088a8a4b6ed018b450000.

Once you have downloaded the study summary files for your experiment to your PC, you need to upload them to the Game Server host (action.rutgers.edu) using sftp or scp. Put them into the same directory under your home directory, e.g. fdcl-01.

Log in to the Game Server host using ssh, go to the directory where you have put the summary files, and run the export transcripts script:

      cd fdcl-01
      /home/vmenkov/w2020/game/scripts/export-transcripts.sh -pid -file prolific_export_671*.csv 

If the script runs successfully, you will find that it has created a directory named tmp, with the following content:

    % ls -l tmp
total 8
drwxr-xr-x  75 vmenkov  staff  2400 Nov  6 19:54 detailed-transcripts/
-rw-r--r--   1 vmenkov  staff   928 Nov  9 15:22 pid.csv
drwxr-xr-x   4 vmenkov  staff   128 Nov  9 15:22 sql/
drwxr-xr-x  75 vmenkov  staff  2400 Nov  6 19:54 transcripts/
    

Here, the subdirectories transcripts and detailed-transcripts contain copies of the relevant CSV files from Game Server's saved-files directories of the same names, with any duplicate episode records removed. The file pid.csv contains the list of the Game Server player IDs, and the subdirectory sql contains relevant extracts from the SQL server's tables as well as the scoring date (see the section "Score data" below)

You can now make a ZIP file containing the entire content of the directory tmp:

      zip -r tmp.zip tmp
      
and then transfer tmp.zip to your personal computer, or any another host, for further analysis. After that, you can remove the directory tmp the Rutgers server, or rename it somehow, so that its content would not be confusing for any future run of the export transcript tool at the same location.

Example 2: Selecting by player ID pattern

Suppose you ran an experiment without using Prolific, by directly sending an invitation URL to a group of students. You have ensured that all participants' player IDs match a certain pattern, e.g. they start with RU-FDCL-basic-. In this example we also assume that you want to include all players with a matching ID into your analysis, because you have never used that exact prefix for your preview. In this case, you can use the following command line:

 /home/vmenkov/w2020/game/scripts/export-transcripts.sh -pid 'RU-FDCL-basic-%'
    
(Note the single quotes and the percent sign, rather than asterisk, in the player ID pattern. This is the syntax needed to instruct the MySQL server to carry out "pattern matching", which is similar to wildcard expansion for file names in a UNIX shell, but uses different special characters.)

The rest of the processing is the same as in Example 1.

Example 3: Selecting by player ID pattern, with some exceptions

What if all player IDs that you're interested in all start with the same prefix... but not all player IDs that start with that prefix are relevant? In other words, you want to exclude some player IDs from your analysis, for example because they are associated with your own testing, rather than with recruited players?

If that is the case, you may need to hand-edit the player list. First, prepare the list of all player IDs matching the pattern. In this example, you'll start by putting that list into the file my-pids.csv in your home directory:

      cd /opt/w2020/saved/transcripts
      /bin/ls RU-FDCL-basic-*.transcript.csv > ~/my-pids.csv
    

After that, go to your home directory and edit the player ID list file, removing the IDs you want to exclude. You can use your favorite text editor (emacs, vi, vim, nano... whatever!)

      cd
      emacs my-pids.csv
    
(All lines in this file end with the suffix "transcript.csv", but that's OK; the export script will ignore this suffix, and look only at the player ID itself).

After that, you can feed the CSV file to the export transcript tool, much like in Example 1:

 /home/vmenkov/w2020/game/scripts/export-transcripts.sh -pid -file my-pids.csv
      

The subsequent processing is the same as in Example 1.

Example 4: Combining Prolific summary files and local player ID lists

What if an experiment involved both players recruited from Prolific (listed in Prolific study summary files) as well as locally recruited players (identified by their player ID prefix)? In this case, to analyze all players' data together, you can combine the approaches of Examples 1 and 3.

  1. As in Example 1, download study summary files from Prolific, and upload them to your directory on the Rutgers server.
  2. As in Example 3, prepare a CSV file ( my-pids.csv) with a list of locally-recruited players with appropriate prefix in their player IDs, and put that file to the same directory. If necessary, edit this file to remove any "preview" players.
  3. In the directory where your Prolific summary files and your PID file are, feed both of them to the export script:
          /home/vmenkov/w2020/game/scripts/export-transcripts.sh -pid -file prolific_export_671*.csv  my-pids.csv
          
  4. The subsequent processing is the same as in Example 1.

Score data

The subdirectory sql contains relevant extracts from the SQL server's tables, as well as 2 files with computed numbers, playerStats.csv and scores.csv:

% ls -l
total 436
-rw-r--r-- 1 vmenkov users 374594 Nov 29 23:47 Episode.csv
-rw-r--r-- 1 vmenkov users  40133 Nov 29 23:47 PlayerInfo.csv
-rw-r--r-- 1 vmenkov users  21470 Nov 29 23:47 playerStats.csv
-rw-r--r-- 1 vmenkov users    743 Nov 29 23:47 scores.csv
    

playerStats.csv

This file contains some processed information about each player's interaction with the first rule set of his trial list (i.e. the rule set for which he was a "naive player"). The content looks like this:

#playerId,trialListId,seriesNo,rule,maxToRemove,episodes,attempts,removed,couldAlsoAttempt,couldAlsoRemove,finishCode,xFactor
prolific-671b8ae84509b9a624b95e13-5e9088a8a4b6ed018b45ba7b,basic-02-A,0,quadNearby,27,2,12,11,16,16,5,4
prolific-671b8ae84509b9a624b95e13-5e4d01bbca55661437cd235b,basic-03-A,0,quadNearby,27,3,46,27,0,0,1,0
prolific-671b8ae84509b9a624b95e13-5c655e3c42faef0001283e77,basic-03-B,0,quadMixed1,27,3,93,27,0,0,1,0
prolific-671b8ae84509b9a624b95e13-5e64e9aae18d1524c7743b5d,basic-04-A,0,allOfColOrd_BRKY,27,3,37,27,0,0,1,0
prolific-671b8ae84509b9a624b95e13-613018599782583980fbf406,basic-04-B,0,allOfShaOrd_qcts,27,3,36,27,0,0,1,2
...
      

The columns' values are as follows:

scores.csv

Aggregate information about the naive player's performance on each rule set.

	sql> more scores.csv
#rule,n,avgAttempts,score,scoreTimesObjectCnt,x
allOfColOrd_BRKY,10,59.1000,2.2074,59.6000,0.30000000
allOfShaOrd_qcts,11,51.1818,1.9327,52.1818,0.50000000
ccw,10,47.8000,1.9333,52.2000,0.45000000
cm_RBKY,9,36.1111,1.4897,40.2222,0.77777778
col1OrdBuck_BRKY0213,12,117.6667,4.3580,117.6667,0.20833333
col1Ord_BRKY,10,54.4000,2.0985,56.6591,0.35000000
colOrdL1_BRKY,10,95.4000,3.6565,98.7252,0.15000000
cw,11,56.1818,2.2803,61.5675,0.50000000
ordL1,22,65.6364,2.6875,72.5625,0.56818182
ordL1_Nearby,10,84.5000,3.2741,88.4000,0.65000000
ordRevOfL1,10,70.8000,2.7000,72.9000,0.55000000
ordRevOfL1_Remotest,11,113.0000,4.4607,120.4393,0.40909091
quadMixed1,10,81.7000,3.1303,84.5172,0E-8
quadNearby,21,52.5238,2.0970,56.6190,0.54761905
sha1OrdBuck_qcts0213,12,97.1667,3.9731,107.2727,0.58333333
sha1Ord_qcts,11,54.0909,2.1246,57.3636,0.40909091
shaOrdL1_qcts,10,80.5000,3.1074,83.9000,0.40000000
sm_csqt,10,47.4000,1.8815,50.8000,0.65000000
	

The semantics of the columns is as follows:

The corrected score for a rule set is computed as

    score = sum( attempts(P) + couldAlsoAttempt(P) ) / sum( removed(P) + couldAlsoRemove(P)),
  
with the summation over all players P who played that rule set as the first rule set in their trial list.

The values being summed are defined as

In other words, for each "early winner" (somebody who played so well that we ended his game) we have notionally "extended" his transcript as if, instead of being stopped, he continued removing all game pieces in all episodes of the series without making any additional errors. (This may be a bit over-optimistic, of course, but not as over-optimistic as simply counting that player's moves without regard to how early he was stopped). For other players, we take their actual transcripts (which may include fewer than maxToRemove (=27 in FDCL/basic)) removed pieces, if they formally gave up or just walked away). Then we compute the weighted average of attemptedMoves/piecesRemoved over all n players.

The value of this "corrected score" would be 1.0 if nobody made any errors, and is proportionally higher as people make more errors. If every player had played all 3 episode to completion, thus removing exactly maxToRemove=27 game pieces, then scoreTimesObjectCnt=score*maxToRemove would be exactly equal to avgAttempts (which, I think, is pretty much the score used by Jacob).