Question

我有一堆包含制表符分隔表的文本文件。第二列包含一个id号，每个文件已经按该id号排序。我想通过第2列中的ID号将每个文件分成多个文件。这就是我所拥有的。

readpath = 'path-to-read-file'
writepath = 'path-to-write-file'
for filename in os.listdir(readpath):
     with open(readpath+filename, 'r') as fh:
          lines = fh.readlines()
     lastid = 0
     f = open(writepath+'checkme.txt', 'w')
     f.write(filename)
     for line in lines:
          thisid = line.split("\t")[1]
          if int(thisid) <> lastid:
               f.close()
               f = open(writepath+thisid+'-'+filename,'w')
               lastid = int(thisid)
          f.write(line)
     f.close()

我得到的只是所有读取文件的副本，其中新文件名前面的每个文件都有第一个id号。好像是

thisid = line.split("\t")[1]

仅在循环中完成一次。对于发生了什么的任何线索？

修改

问题是我的文件使用\ r而不是\ r \ n终止行。更正后的代码（只需添加＆＃39; rU＆＃39;打开读取文件并交换！= for＆lt;＆gt;）：

readpath = 'path-to-read-file'
writepath = 'path-to-write-file'
for filename in os.listdir(readpath):
     with open(readpath+filename, 'rU') as fh:
          lines = fh.readlines()
     lastid = 0
     f = open(writepath+'checkme.txt', 'w')
     f.write(filename)
     for line in lines:
          thisid = line.split("\t")[1]
          if int(thisid) != lastid:
               f.close()
               f = open(writepath+thisid+'-'+filename,'w')
               lastid = int(thisid)
          f.write(line)
     f.close()

Answer 1

如果您正在处理制表符分隔文件，那么您可以使用csv模块，并利用itertools.groupby将为您执行上一个/当前跟踪ID的事实。还可以使用os.path.join确保您的文件名最终正确加入。

未测试：

import os
import csv
from itertools import groupby

readpath = 'path-to-read-file'
writepath = 'path-to-write-file'

for filename in os.listdir(readpath):
    with open(os.path.join(readpath, filename)) as fin:
        tabin = csv.reader(fin, delimiter='\t')
        for file_id, rows in groupby(tabin, lambda L: L[1]):
            with open(os.path.join(writepath, file_id + '-' + filename), 'w') as fout:
                tabout = csv.writer(fout, delimiter='\t')
                tabout.writerows(rows)

如何在python中按id拆分文本文件

1 个答案: