在Python中整形数据

时间:2014-07-22 17:03:52

标签: python csv

我目前正在使用eyelink生成的数据。 csv(从asc转换而来)基本上是一个大的顺序列表,即不创建列,因此例如一行将具有'start_trial 1'并且在进入' PreBeep1_1st_Sketchpad'之前,下一行将具有x和y坐标以及以下N行。排,最终' start_trial 2'行。

我想知道是否有人对如何操纵这个'堆积'数据并将其转换为长格式数据?

以下是从csv中提取数据时的样子:

MSG 12892743 start_trial    1   SCNB    
12892743      757.0   361.7  5916.0 ... SCNB    
MSG 12892744 PreBeep1_1st_Sketchpad SCNB
12892744      756.7   361.7  5920.0 ... SCNB    
12892745      756.1   362.2  5924.0 ... SCNB    
MSG 12892746 order of frames:   SCNB    
12892746      755.8   362.3  5928.0 ... SCNB    
12892747      756.7   362.3  5927.0 ... SCNB    
MSG 12892748 crosshair  SCNB    
12892748      757.8   361.8  5928.0 ... SCNB    
12892749      758.4   361.8  5930.0 ... SCNB    
MSG 12892750 sketchpad  SCNB    
12892750      758.1   361.7  5934.0 ... SCNB    
12892751      758.3   361.7  5938.0 ... SCNB    
MSG 12892752 sketchpad  SCNB    
12892752      759.1   361.9  5948.0 ... SCNB    
12892753      760.4   362.7  5956.0 ... SCNB    
MSG 12892754 sketchpad  SCNB    
12892754      761.7   363.5  5964.0 ... SCNB    
12892755      763.9   364.0  5966.0 ... SCNB    
MSG 12892756 buffer1    SCNB    
12892756      765.6   364.1  5970.0 ... SCNB    
12892757      766.2   364.3  5972.0 ... SCNB    
MSG 12892758 Diode1 SCNB    
12892758      765.2   364.3  5973.0 ... SCNB    
12892759      764.1   364.5  5964.0 ... SCNB    
12892760      763.9   364.7  5955.0 ... SCNB

理想情况下,我希望为:

设置单独的列
Trial ID (SCNB shown above)
Frame ID (PreBeep1_1st_Sketchpad above)
X-CoOr (757.0 above)
Y-CoOr (361.7 above)
Time (5916.0 above)

如果有帮助,分隔符在csv文件中。

可以看出,数据是从上到下依次逐行写入的,而不是按照我想要的形状组织成列。

' ...'也是实际值。

关于包含框架ID的列,例如' start_trial'和' PreBeep1_1st_Sketchpad'理想情况下,我希望在列中重复该帧的名称,直到遇到一个新的。

非常感谢任何帮助或建议。

编辑:输出应如下所示:

Trial ID       Frame ID                 X-CoOr    Y-CoOr    Time 
  SCNB           Start_Trial              757.0    361.7    5916.0 
  SCNB           PreBeep1_1st_Sketchpad   756.7    361.7    5920.0
  SCNB           PreBeep1_1st_Sketchpad   756.1    362.2    5924.0

感谢您花时间阅读。

编辑:

以下是我正在使用的代码:

file2 = open('P1E2E_Both_New_trial_data.csv', 'rb')
Long_Format = open('P1E2E_Long_Format.csv', 'w')
writer1 = csv.writer(Long_Format, delimiter = '\t')

#First create column headings
columns = ["Trial ID"] + ['Frame ID'] + ['X-CoOr'] + ['Y-CoOr'] + ['Time']
writer1.writerow(columns)

reader1 = csv.reader(file2, delimiter = '\t')

for row in reader1:
    # if statement here to skip blank lines
    if len(row) > 1:
        if 'start_trial' in row[1]:
            label = [row[3]] + ['start_trial']
            writer1.writerow(label)



file2.close()   # <---IMPORTANT
Long_Format.close()

以上的输出是:

Trial ID      Frame ID      X-CoOr     Y-CoOr     Time

SCNB          start_trial

RCL           start_trial

SCR           start_trial

......等等。

我的问题在于我不知道从哪里开始。即使是工作,我的方法也会非常低效。我不知道如何告诉python继续阅读标签&#39; Start_Trial&#39;之后的行。在if语句中,在所述标签之后的相应列中写入行[2]和行[3]中的x和y CoOr值。这有道理吗?

2 个答案:

答案 0 :(得分:1)

如果我们假设所有行都有相同的删除计,那么这个问题并不像它看起来那么糟糕。

关键是要意识到所有的帧行都以键'MSG'

开头
import csv
# Header values
FRAME_KEY = 'MSG'
FRAME_IDX = 0
TRIAL_ID_KEY = 'Trial ID'
TRIAL_ID_IDX = 3
FRAME_ID_KEY = 'Frame ID'
FRAME_ID_IDX = 2
# Data values
XCOR_KEY     = 'X-CoOr'
XCOR_IDX     = 1
YCOR_KEY     = 'Y-CoOr'
YCOR_IDX     = 2
TIME_KEY     = 'Time'
TIME_IDX     = 3

IN_DELIM = '\t'
OUT_DELIM= '\t'

OUT_HEADER = [TRIAL_ID_KEY, FRAME_ID_KEY, XCOR_KEY, YCOR_KEY, TIME_KEY]

with open('P1E2E_Both_New_trial_data.csv', 'rb') as in_file, open('P1E2E_Long_Format.csv') as out_file:
    in_reader = csv.reader(in_file, delimeter = IN_DELIM)
    out_writer= csv.DictWriter(out_file, OUT_HEADER, delimeter = OUT_DELIM)
    out_writer.writeheader()
    current_frame = None
    current_trial = None
    for row in in_reader:
        if row[FRAME_IDX] == FRAME_KEY:
            # Means we're at the start of a new frame
            current_frame = row[FRAME_ID_IDX]
            current_trial = row[TRIAL_ID_IDX]
        else:
            # Means we're in a data row
            out_row = dict()
            out_row[FRAME_ID_KEY] = current_frame
            out_row[TRIAL_ID_KEY] = current_trial
            out_row[XCOR_KEY]     = row[XCOR_IDX]
            out_row[YCOR_KEY]     = row[YCOR_IDX]
            out_row[TIME_KEY]     = row[TIME_IDX]
            out_writer.writerow(out_row)

基本上,当你使用'MSG'键敲击一行时,你知道你正在开始一个新的框架。否则你写出数据。 DictWriter可让您轻松自动执行此操作,而无需担心订单(订单由OUT_HEADER定义)

答案 1 :(得分:0)

我已经调整了@aruisdante提交的答案。这是因为原始代码没有记录帧ID的每个实例。我在计算start_trial帧ID时注意到了这一点,但是它们没有达到已知的总数。

以下是修订后的代码:

FRAME_KEY = 'MSG'
FRAME_IDX = 0
FRAME_ID_KEY = 'Frame ID'
FRAME_ID_IDX = 1
TRIAL_ID_KEY = 'Trial ID'
TRIAL_ID_IDX = 2
# Data values
XCOR_KEY     = 'X-CoOr'
XCOR_IDX     = 1
YCOR_KEY     = 'Y-CoOr'
YCOR_IDX     = 2
TIME_KEY     = 'Time'
TIME_IDX     = 3

IN_DELIM = '\t'
OUT_DELIM= '\t'

OUT_HEADER = [TRIAL_ID_KEY, FRAME_ID_KEY, XCOR_KEY, YCOR_KEY, TIME_KEY]

currentframecount = 0
currentframecount1 = 0
out_row = dict()


with open('P1E2E_Both_New_trial_data.csv', 'rb') as in_file, open('P1E2E_Long_Format.csv', 'w') as out_file:
in_reader = csv.reader(in_file, delimiter = IN_DELIM)
out_writer= csv.DictWriter(out_file, OUT_HEADER, delimiter = OUT_DELIM)
out_writer.writeheader()
current_frame = None
current_trial = None

for row in in_reader:
    if row[FRAME_IDX] == FRAME_KEY:
        # Means we're at the start of a new frame
        current_frame = row[FRAME_ID_IDX]
        current_trial = row[TRIAL_ID_IDX]

        #out_row[TRIAL_ID_KEY] = current_trial
        #out_row[FRAME_ID_KEY] = current_frame
        #out_writer.writerow(out_row)
        #if 'start_trial' in current_frame:
        #   currentframecount += 1
        #  print currentframecount
        # Here ensures that 'start_trail' labels are recorded
        if 'start_trial' in row[FRAME_ID_IDX]:
            out_row[FRAME_ID_KEY] = row[FRAME_ID_IDX]
            out_writer.writerow(out_row)


    else:
        # Means we're in a data row
        #Here write everything except 'start_trial' to ensure no repetition of this particular label
        if 'start_trial' not in current_frame:
            out_row[FRAME_ID_KEY] = current_frame # think this is pulling value from last if statement on current_frame

            out_row[TRIAL_ID_KEY] = current_trial
            out_row[XCOR_KEY]     = row[XCOR_IDX]
            out_row[YCOR_KEY]     = row[YCOR_IDX]
            out_row[TIME_KEY]     = row[TIME_IDX]
            out_writer.writerow(out_row)