51229

Shaping Data in Python

Question:

I'm currently working with data generated by eyelink. The csv (transformed from asc) is basically one large sequential list, i.e. columns are not created, so for example a row will have 'start_trial 1' and the following row will have x and y coordinates and the following N rows will also before coming to 'PreBeep1_1st_Sketchpad' row and eventually 'start_trial 2' row.

I was wondering if anyone has any advice on how to manipulate this 'stacked' data and transform it into long form data?

Here is what the data looks like when pulled from the csv:

MSG 12892743 start_trial 1 SCNB 12892743 757.0 361.7 5916.0 ... SCNB MSG 12892744 PreBeep1_1st_Sketchpad SCNB 12892744 756.7 361.7 5920.0 ... SCNB 12892745 756.1 362.2 5924.0 ... SCNB MSG 12892746 order of frames: SCNB 12892746 755.8 362.3 5928.0 ... SCNB 12892747 756.7 362.3 5927.0 ... SCNB MSG 12892748 crosshair SCNB 12892748 757.8 361.8 5928.0 ... SCNB 12892749 758.4 361.8 5930.0 ... SCNB MSG 12892750 sketchpad SCNB 12892750 758.1 361.7 5934.0 ... SCNB 12892751 758.3 361.7 5938.0 ... SCNB MSG 12892752 sketchpad SCNB 12892752 759.1 361.9 5948.0 ... SCNB 12892753 760.4 362.7 5956.0 ... SCNB MSG 12892754 sketchpad SCNB 12892754 761.7 363.5 5964.0 ... SCNB 12892755 763.9 364.0 5966.0 ... SCNB MSG 12892756 buffer1 SCNB 12892756 765.6 364.1 5970.0 ... SCNB 12892757 766.2 364.3 5972.0 ... SCNB MSG 12892758 Diode1 SCNB 12892758 765.2 364.3 5973.0 ... SCNB 12892759 764.1 364.5 5964.0 ... SCNB 12892760 763.9 364.7 5955.0 ... SCNB

Ideally I'd like to have individual columns for:

Trial ID (SCNB shown above) Frame ID (PreBeep1_1st_Sketchpad above) X-CoOr (757.0 above) Y-CoOr (361.7 above) Time (5916.0 above)

Delimiters are \t in the csv file if that helps.

As can be seen the data is written row-after-row sequentially from top-to-bottom instead of being organised into columns as I want to shape them.

the '...' are actual values also.

Regarding the column that will contain Frame IDs such as 'start_trial' and 'PreBeep1_1st_Sketchpad' I would ideally want the name of that frame repeated in the column until encountering a new one.

Any help or advice would be greatly appreciated.

EDIT: Output should look like this:

Trial ID Frame ID X-CoOr Y-CoOr Time SCNB Start_Trial 757.0 361.7 5916.0 SCNB PreBeep1_1st_Sketchpad 756.7 361.7 5920.0 SCNB PreBeep1_1st_Sketchpad 756.1 362.2 5924.0

Thanks for taking the time to read.

EDIT:

Here is the code I was working with:

file2 = open('P1E2E_Both_New_trial_data.csv', 'rb') Long_Format = open('P1E2E_Long_Format.csv', 'w') writer1 = csv.writer(Long_Format, delimiter = '\t') #First create column headings columns = ["Trial ID"] + ['Frame ID'] + ['X-CoOr'] + ['Y-CoOr'] + ['Time'] writer1.writerow(columns) reader1 = csv.reader(file2, delimiter = '\t') for row in reader1: # if statement here to skip blank lines if len(row) > 1: if 'start_trial' in row[1]: label = [row[3]] + ['start_trial'] writer1.writerow(label) file2.close() # <---IMPORTANT Long_Format.close()

The output for the above is:

Trial ID Frame ID X-CoOr Y-CoOr Time SCNB start_trial RCL start_trial SCR start_trial

... and so on.

My problem lies in that I don't know where to go from here. My approach would be terribly inefficient even it were to work. I don't know how to tell python to continue reading the lines after the label 'Start_Trial' in the if statement and to write the x and y CoOr values from row[2] and row[3] in the appropriate columns after said label. Does that makes sense?

Answer1:

If we assume that all lines have the same delemeter, this problem isn't as bad as it looks.

The key is realizing that all of the frame lines start with the key 'MSG':

import csv # Header values FRAME_KEY = 'MSG' FRAME_IDX = 0 TRIAL_ID_KEY = 'Trial ID' TRIAL_ID_IDX = 3 FRAME_ID_KEY = 'Frame ID' FRAME_ID_IDX = 2 # Data values XCOR_KEY = 'X-CoOr' XCOR_IDX = 1 YCOR_KEY = 'Y-CoOr' YCOR_IDX = 2 TIME_KEY = 'Time' TIME_IDX = 3 IN_DELIM = '\t' OUT_DELIM= '\t' OUT_HEADER = [TRIAL_ID_KEY, FRAME_ID_KEY, XCOR_KEY, YCOR_KEY, TIME_KEY] with open('P1E2E_Both_New_trial_data.csv', 'rb') as in_file, open('P1E2E_Long_Format.csv') as out_file: in_reader = csv.reader(in_file, delimeter = IN_DELIM) out_writer= csv.DictWriter(out_file, OUT_HEADER, delimeter = OUT_DELIM) out_writer.writeheader() current_frame = None current_trial = None for row in in_reader: if row[FRAME_IDX] == FRAME_KEY: # Means we're at the start of a new frame current_frame = row[FRAME_ID_IDX] current_trial = row[TRIAL_ID_IDX] else: # Means we're in a data row out_row = dict() out_row[FRAME_ID_KEY] = current_frame out_row[TRIAL_ID_KEY] = current_trial out_row[XCOR_KEY] = row[XCOR_IDX] out_row[YCOR_KEY] = row[YCOR_IDX] out_row[TIME_KEY] = row[TIME_IDX] out_writer.writerow(out_row)

Basically, when you hit a row with the 'MSG' key, you know you're starting a new frame. Otherwise you write out the data. DictWriter makes it easy to do this automatically without having to worry about order (the order is defined by the OUT_HEADER)

Answer2:

I've adapted the answer submitted by @aruisdante. This is because the original code did not record every instance of Frame IDs. I noticed this when doing a count of start_trial frame IDs and they fell short of the known total.

Here is the amended code:

FRAME_KEY = 'MSG' FRAME_IDX = 0 FRAME_ID_KEY = 'Frame ID' FRAME_ID_IDX = 1 TRIAL_ID_KEY = 'Trial ID' TRIAL_ID_IDX = 2 # Data values XCOR_KEY = 'X-CoOr' XCOR_IDX = 1 YCOR_KEY = 'Y-CoOr' YCOR_IDX = 2 TIME_KEY = 'Time' TIME_IDX = 3 IN_DELIM = '\t' OUT_DELIM= '\t' OUT_HEADER = [TRIAL_ID_KEY, FRAME_ID_KEY, XCOR_KEY, YCOR_KEY, TIME_KEY] currentframecount = 0 currentframecount1 = 0 out_row = dict() with open('P1E2E_Both_New_trial_data.csv', 'rb') as in_file, open('P1E2E_Long_Format.csv', 'w') as out_file: in_reader = csv.reader(in_file, delimiter = IN_DELIM) out_writer= csv.DictWriter(out_file, OUT_HEADER, delimiter = OUT_DELIM) out_writer.writeheader() current_frame = None current_trial = None for row in in_reader: if row[FRAME_IDX] == FRAME_KEY: # Means we're at the start of a new frame current_frame = row[FRAME_ID_IDX] current_trial = row[TRIAL_ID_IDX] #out_row[TRIAL_ID_KEY] = current_trial #out_row[FRAME_ID_KEY] = current_frame #out_writer.writerow(out_row) #if 'start_trial' in current_frame: # currentframecount += 1 # print currentframecount # Here ensures that 'start_trail' labels are recorded if 'start_trial' in row[FRAME_ID_IDX]: out_row[FRAME_ID_KEY] = row[FRAME_ID_IDX] out_writer.writerow(out_row) else: # Means we're in a data row #Here write everything except 'start_trial' to ensure no repetition of this particular label if 'start_trial' not in current_frame: out_row[FRAME_ID_KEY] = current_frame # think this is pulling value from last if statement on current_frame out_row[TRIAL_ID_KEY] = current_trial out_row[XCOR_KEY] = row[XCOR_IDX] out_row[YCOR_KEY] = row[YCOR_IDX] out_row[TIME_KEY] = row[TIME_IDX] out_writer.writerow(out_row)

Recommend

  • Rhino Mocks: AAA Synax: Assert property was set with a given type
  • How to create listeners with javascript
  • Labeling Gmail message (not the whole thread) with Google Apps Script
  • Detect a UIScrollView's height dynamically
  • How to accept comma-delimited list to build tags for model?
  • How to find exact size for an arbitrary glyph in WPF?
  • What happens when you call `append` on a list?
  • In C#, Parse string into individual characters
  • Alternative for JComponent in JavaFX
  • Passing checkbox values to database using JavaScript
  • How to change individual action item text color in ActionBar PROGRAMMATICALLY?
  • Barack doesn’t like anything that Donald likes
  • How to train a model when the derivative is not known and a batch of outputs is required to calculat
  • How can I iterate over Pandas pivot table? (A multi-index dataframe?)
  • Get Items in a PurchaseOrder using SuiteTalk
  • loop generate plots for variables in a data frame
  • How to get count of people based on age groups using SQL query in Oracle database?
  • Resizable jQuery for booking rooms
  • Raven DB Replication Setup Issue
  • Convert rgba colour definition string in LESS to color instance
  • Seperate comma separated mySql database field value with php
  • git fork repo to same organization
  • Design Paradigm for instantiating object from XML file
  • Java: Is there any simpler way to parse array elements from string?
  • Laravel / Eloquent hasMany relationship sum()
  • The best way to mark (split?) dataset in each string
  • How to count words in a Hashmap
  • How do I access a nested tag in a kml in openlayers?
  • Multiple Actionscript 3 layers
  • How to token a word which combined by two words without whitespace
  • How can I read the liblist from within an ILE-Program? (preferably RPG or CL)
  • Check multiple file exists or not using ANT
  • Efficient way to upload multiple images to S3 from iOS
  • How to Dynamically adding fields in JSF?
  • RDF - Distributing rdf:type to all items in the list
  • Extracting individual digits from a float
  • mapping between two ontologies
  • Cannot invoke my method on the array type int[]
  • Lua: Line breaks in strings
  • Iron Router: How do I send data to the layout?