我有一堆包含制表符分隔表的文本文件。第二列包含一个id号,每个文件已经按该id号排序。我想通过第2列中的ID号将每个文件分成多个文件。这就是我所拥有的。
readpath = 'path-to-read-file'
writepath = 'path-to-write-file'
for filename in os.listdir(readpath):
with open(readpath+filename, 'r') as fh:
lines = fh.readlines()
lastid = 0
f = open(writepath+'checkme.txt', 'w')
f.write(filename)
for line in lines:
thisid = line.split("\t")[1]
if int(thisid) <> lastid:
f.close()
f = open(writepath+thisid+'-'+filename,'w')
lastid = int(thisid)
f.write(line)
f.close()
我得到的只是所有读取文件的副本,其中新文件名前面的每个文件都有第一个id号。好像是
thisid = line.split("\t")[1]
仅在循环中完成一次。对于发生了什么的任何线索?
修改
问题是我的文件使用\ r而不是\ r \ n终止行。更正后的代码(只需添加&#39; rU&#39;打开读取文件并交换!= for&lt;&gt;):
readpath = 'path-to-read-file'
writepath = 'path-to-write-file'
for filename in os.listdir(readpath):
with open(readpath+filename, 'rU') as fh:
lines = fh.readlines()
lastid = 0
f = open(writepath+'checkme.txt', 'w')
f.write(filename)
for line in lines:
thisid = line.split("\t")[1]
if int(thisid) != lastid:
f.close()
f = open(writepath+thisid+'-'+filename,'w')
lastid = int(thisid)
f.write(line)
f.close()
答案 0 :(得分:3)
如果您正在处理制表符分隔文件,那么您可以使用csv
模块,并利用itertools.groupby
将为您执行上一个/当前跟踪ID的事实。还可以使用os.path.join
确保您的文件名最终正确加入。
未测试:
import os
import csv
from itertools import groupby
readpath = 'path-to-read-file'
writepath = 'path-to-write-file'
for filename in os.listdir(readpath):
with open(os.path.join(readpath, filename)) as fin:
tabin = csv.reader(fin, delimiter='\t')
for file_id, rows in groupby(tabin, lambda L: L[1]):
with open(os.path.join(writepath, file_id + '-' + filename), 'w') as fout:
tabout = csv.writer(fout, delimiter='\t')
tabout.writerows(rows)