我正在尝试将我的数据从单个文件分组到两个单独的文件,并分别计算每个文件中的行。
ID,MARK1,MARK2
sire1,AA,BB
dam2,AB,AA
sire3,AB,-
dam1,AA,BB
IND4,BB,AB
IND5,BB,AA
一个文件是:
ID,MARK1,MARK2
sire1,AA,BB
dam2,AB,AA
sire3,AB,-
dam1,AA,BB
另一个是:
ID,MARK1,MARK2
IND4,BB,AB
IND5,BB,AA
这是我的代码:
import re
def file_len(filename):
with open(filename, mode = 'r', buffering = 1) as f:
for i, line in enumerate(f):
pass
return i
inputfile = open("test.txt", 'r')
outputfile_f1 = open("f1.txt", 'w')
outputfile_f2 = open("f2.txt", 'w')
matchlines = inputfile.readlines()
outputfile_f1.write(matchlines[0]) #add the header to the "f1.txt"
for line in matchlines:
if re.match("sire*", line):
outputfile_f1.write(line)
elif re.match("dam*", line):
outputfile_f1.write(line)
else:
outputfile_f2.write(line)
print 'the number of individuals in f1 is:', file_len(outputfile_f1)
print 'the number of individuals in f2 is:', file_len(outputfile_f2)
inputfile.close()
outputfile_f1.close()
outputfile_f2.close()
代码可以将文件的子集分开,但我特别不喜欢将标题添加到新文件的方式,我想知道是否有更好的方法可以做到这一点?此外,该函数看起来很好,可以计算行数,但是当我运行它时,它给了我一个错误
"Traceback (most recent call last):
File "./subset_individuals_based_on_ID.py", line 28, in <module>
print 'the number of individuals in f1 is:', file_len(outputfile_f1)
File "./subset_individuals_based_on_ID.py", line 7, in file_len
with open(filename, mode = 'r', buffering = 1) as f:
TypeError: coercing to Unicode: need string or buffer, file found
"
所以我用谷歌搜索了这个网站,添加了buffering = 1
(它原来不在代码中),仍然无法解决问题。
非常感谢您帮助改进代码并清除错误。
答案 0 :(得分:1)
我可能会误读你,但我相信你只是想这样做:
>>> with open('test', 'r') as infile:
... with open('test_out1', 'w') as out1, open('test_out2', 'w') as out2:
... header, *lines = infile.readlines()
... out1.write(header)
... out2.write(header)
... for line in lines:
... if line.startswith('sir') or line.startswith('dam'):
... out1.write(line)
... else:
... out2.write(line)
之前test
的内容:
ID,MARK1,MARK2
sire1,AA,BB
dam2,AB,AA
sire3,AB,-
dam1,AA,BB
IND4,BB,AB
IND5,BB,AA
test_out1
之后的内容:
ID,MARK1,MARK2
sire1,AA,BB
dam2,AB,AA
sire3,AB,-
dam1,AA,BB
test_out2
之后的内容:
ID,MARK1,MARK2
IND4,BB,AB
IND5,BB,AA
答案 1 :(得分:1)
您还可以使用itertools.tee
将输入拆分为多个流并单独处理。
import itertools
def write_file(match, source, out_file):
count = -1
with open(out_file, 'w') as output:
for line in source:
if count < 0 or match(line):
output.write(line)
count += 1
print('Wrote {0} lines to {1}'.format(count, out_file))
with open('test.txt', 'r') as f:
first, second = itertools.tee(f.readlines())
write_file(lambda x: not x.startswith('IND'), first, 'f1.txt')
write_file(lambda x: x.startswith('IND'), second, 'f2.txt')
编辑 - 删除多余的elif