使用Python修剪制表符分隔文件

时间:2015-05-08 06:48:19

标签: python csv

所以我有一堆制表符分隔的数据文件,如下所示:

Subject Phase   Condition   Trial   Trial Type  Target Loc  TargetID    DistID  Digit1  Digit2  Accuracy-T  RT-P    RT-T
2   1   9   1   cong    bottom  S H I F T   S H I F T   7   2   1   742.69104   681.4379692
2   1   9   2   cong    top P A S T E   P A S T E   2   3   1   699.4130611 454.8609257
2   1   9   3   incong  top S U G A R   Y O U T H   6   5   1   979.2759418 31.06093407
2   1   9   4   incong  top C H E E K   G R O A N   4   8   1   1025.339842 31.55088425
2   1   9   5   incong  bottom  S T A L K   L E A V E   7   9   1   555.9248924 479.6338081
2   1   9   6   incong  top B R A I N   F I E L D   4   5   2   976.7041206 31.50486946
2   1   9   7   incong  bottom  C R O W N   P L A T E   5   7   1   0   32.24992752
2   1   9   8   cong    top S T A N D   S T A N D   7   6   1   1092.888117 31.59618378
2   1   9   9   cong    bottom  R O U T E   R O U T E   4   8   1   883.2840919 31.32796288
2   1   9   10  cong    top F L O A T   F L O A T   5   6   1   768.682003

我想要做的是从文件中删除值为' 2'的任何行。或者' 3'根据' Accuracy-T'标题(对不起,他们错误地分配了它 - 它是第10个值)。

所以基本的想法是一个python脚本,它在多个文件上迭代这个函数(在这里看作' studyfile')并吐出一个新的制表符分隔文本文件,删除这些项目(在这里看作&# 39; goodstudyfile&#39)。所以我想出了这个:

GroupVar=['1','2']
SubjectVar=['1','2']
CondVar=['1','2','3','4','5','6','7','8','9','10','11','12']

for group in GroupVar:
    for subject in SubjectVar: 
        for condition in CondVar:
            studyfile_name = '*/Pruning/Study 126/Group_'+str(group)+'_Subject_'+str(subject)+'_Condition_'+str(condition)+'_phase_1.txt'
            studyfile = open(studyfile_name,'r')

            goodstudyfile_name = '*/Pruning/Study 126/Phase 1 No Errors/Group_'+str(group)+'_Subject_'+str(subject)+'_Condition_'+str(condition)+'_phase_1_Fixed.txt'
            goodstudyfile = open(goodstudyfile_name,'w')

            study_lines = studyfile.readlines()

            studyfile.close()

            first_block = study_lines[4].split('\t')[1].strip()

            NR_errors_removed = 0
            R_errors_removed = 0
            spoils_removed = 0
            low_cutoff_spoils = 0
            for study_line in study_lines:
                if len(study_line.split('\t')) > 2:
                    if study_line.split('\t')[10] == '2':
                        if study_line.split('\t')[4] == 'incong':
                            study_lines.remove(study_line)
                            NR_errors_removed+=1
                        elif study_line.split('\t')[4] == 'cong':
                            study_lines.remove(study_line)
                            R_errors_removed+=1                                   
                    elif study_line.split('\t')[10] == '3':
                            study_lines.remove(study_line)
                            spoils_removed+=1
                    else:
                        for study_line in study_lines[1:]:                        
                            if int(float(study_line.split('\t')[12][:8])) < 100.00:
                                study_lines.remove(study_line)
                                low_cutoff_spoils+=1
            print 'Group:' + str(group) + ' Subject:' + str(subject) + ' Condition:' + str(condition)
            print 'NR Errors:'+ str(NR_errors_removed)
            print 'R Errors:'+ str(R_errors_removed)
            print 'Spoils:'+ str(spoils_removed)
            print 'low cutoff Spoils:'+ str(low_cutoff_spoils)
            goodstudyfile.write('{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\n'.format(NR_errors_removed, 'NR errors removed', R_errors_removed, 'R errors removed',spoils_removed, 'spoils removed',low_cutoff_spoils, 'low cutoff spoils'))
            goodstudyfile.write('{}\n'.format(first_block))
            for line in study_lines:
                goodstudyfile.write(line)
            goodstudyfile.close()

所以这在我的所有文件中都很好地迭代(基于组,主题和condvar组合的所有可能排列的48个文件),但由于某种原因它经常错过应该删除的行。所以在所谓的“固定”中文件,我还有一堆应该删除的行。

我做的任何事情似乎都无法解决甚至改变结果 - 错过的行总是一致的(即,尽管第7行被标记为&#39; 2),它总是会错过Group2_Subject1_Condition_6的第7行。有人能告诉我哪里出错了吗?

以及这里缺少的一条线的例子:

Subject Phase   Condition   Trial   Trial Type  Target Loc  TargetID    DistID  Digit1  Digit2  Accuracy-T  RT-P    RT-T    
1   1   6   25  incong  top V A L U E   G U I D E   9   7   2   304.780960083   866.713047028

这应该由python脚本修剪,因为它的值为&#39; 2&#39;在Accuracy-T

0 个答案:

没有答案