我已经下载了这个csv file,它会创建基因信息的电子表格。重要的是,HLA-*
列中有基因信息。如果基因分辨率太低,例如DQB1*03
然后应该删除该行。如果数据太高,例如DQB1*03:02:01
,然后需要删除最后的:01
标记。因此,理想情况下,我希望蛋白质的格式为DQB1*03:02
,因此它在DQB1*
之后具有两级分辨率。如何告诉python查找这些格式,并忽略存储在其中的数据。
例如
if (csvCell is of format DQB1*03:02:01):
delete the :01 # but do this in a general format
elif (csvCell is of format DQB1*03):
delete row
else:
goto next line
更新:我引用的编辑代码
import csv
import re
import sys
csvdictreader = csv.DictReader(open('mhc.csv','r+b'), delimiter=',')
csvdictwriter = csv.DictWriter(file('mhc_fixed.csv','r+b'), fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-D')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^\w+\*\d\d$', value):
keep = False
break # quit processing target fields
elif re.match(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$', value):
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
if keep:
csvdictwriter.writerow(rowfields)
答案 0 :(得分:2)
这是一个超简单的过滤器:
import sys
for line in sys.stdin:
line = line.replace( ',DQB1*03:02:01,', ',DQB1*03:02,' )
if line.find( ',DQB1*03,' ) == -1:
sys.stdout.write( line )
或者,如果你想使用正则表达式
import re
import sys
for line in sys.stdin:
line = re.sub( ',DQB1\\*03:02:01,', ',DQB1*03:02,', line )
if re.search( ',DQB1\\*03,', line ) == None:
sys.stdout.write( line )
将其作为
运行python script.py < data.csv
答案 1 :(得分:2)
以下是我认为会做你想做的事情。它不像Peter的答案那么简单,因为它使用Python的csv
模块来处理文件。它可能被重写和简化,只是将文件视为纯文本,但这应该很容易。
import csv
import re
import sys
csvdictreader = csv.DictReader(sys.stdin, delimiter=',')
csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?
keep = False
break # quit processing target fields
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',
r'DQB1*\1:\2', value)
if keep:
csvdictwriter.writerow(rowfields)
对我来说最困难的部分是确定你想做什么。