Question

我有一个大的制表符分隔文件，其中包含大约140万行和50列。在我对文件中包含的数据执行任何操作之前，我想将此大文件拆分为大约几千个较小的文件。我的文件的第一列包含位置信息，我希望每个较小的文件都是基于此信息的特定间隔。在单独的列表中，我有每个间隔的开始和停止，我想要拆分较大的文件。以下是我的代码执行此操作的部分，开始和停止位置包含在名为start_L和stop_L的列表中：

for i in range(len(id)):
   out1=((file%s.txt)%(id[i]))
   table=open('largefile.tsv',"r")
   start=int(start_L[i])
   stop=int(stop_L[i])
   table.next()
   temp_out=open(out1,"w")
   reader=csv.reader(table,delimiter="\t")
   for line in reader:
       if int(line[0]) in range(start,stop):
           for y in line:
               temp_out.write(("%s\t")%(y))
           temp_out.write("\n")
    else:
        if int(line[0]) > stop:
            break
        else:
            pass
print "temporary file..." , id[i]

上面的代码实现了我想要的，但速度极慢。它可以在几分钟内处理前一百个左右的间隔，但每次通过间隔时会以指数方式变慢，因此运行需要数天。有更快，更有效的方法吗？我认为问题是它必须扫描整个文件，以便每次通过循环找到指定间隔内的位置。

Answer 1

程序随时间变慢的原因是因为您正在为每个输出文件反复重读CSV文件。当您查看的范围向下移动CSV文件时，您需要为每个输出文件读取越来越多的数据（大部分都是您跳过的）。因此，性能呈指数下降。

您需要重新组织代码，以便只按顺序读取一次CSV，然后在循环中选择感兴趣的范围（并将它们写入文件）。只有当CSV按范围排序（您说是）并且您的start_L / stop_L也相应排序时，才可以这样做。

Answer 2

在大多数情况下，上面提供的解决方案对我有帮助，但由于我的输入没有＃行，我不得不修改以下更改。

    table=fileinput.input('largefile.csv',mode="r")
    #
    #
    #
         if fileinput.lineno() >= stop :

我的档案是|以约600k线和约120MB的尺寸划分;整个文件只需几秒钟即可拆分。

Answer 3

好的，我试着按照你的代码精神来保持这个。它只通过大文件迭代一次，它不会打扰通过csv模块解析行，因为你只是在写入期间重新加入它们。

id=("a","b")
start_L=(1,15)
stop_L=(16,40)

i=0
table=open('largefile.tsv',"r")
out1=(("file%s.txt")%(id[i]))
temp_out=open(out1,"w")

# start iterating through the file 
for line in table:
     stop=int(stop_L[i])

     # Split the line into a position piece, and a 
     # throw away variable based upon the 1st tab char
     position,the_rest= line.split("\t",1)

     # I'm ignoring start as you mentioned it was sorted in the file
     if int(position) >= stop :
           # Close the current file
           temp_out.close()

           # Increment index so file name is pulled from id properly
           # If the index is past the length of the id list then 
           # break otherwise open the new file for writing
           i += 1  
           if (i < len(id)):
             out1=(("file%s.txt")%(id[i]))
             temp_out=open(out1,"w")
           else:
             break 

     temp_out.write(line)

我的测试文件行看起来像

1       1a      b       c       d       e
2       2a      b       c       d       e
3       3a      b       c       d       e

根据您的具体数据，这可以简化很多，但我希望它至少可以帮助您开始。

在python中拆分大的制表符分隔文件

3 个答案: