如何分割文件?

时间:2018-11-18 06:40:14

标签: python file

我试图通过将文本文件的总行平均划分为多个文件来分割文件。但是,它不会平均分割大小。有没有一种方法可以将文件分成多个相等的块,而无需使用Python 3.xx向文件中写入任何不完整的行?例如,一个100mb的文本文件将分为33mb,33mb和34mb。

这是到目前为止我得到的:

chunk=3
my_file = 'file.txt'
NUM_OF_LINES=-(-(sum(1 for line in open(my_file)))//chunk)+1
print(NUM_OF_LINES)


sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * NUM_OF_LINES
    left = len(hold_lines) - increment
    file_name = "text.txt_" + str(outer_count * NUM_OF_LINES) + ".txt"
    hold_new_lines = []
    if left < NUM_OF_LINES:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < NUM_OF_LINES:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)

2 个答案:

答案 0 :(得分:1)

如果保持行的顺序顺序并不重要,那么https://stackoverflow.com/a/30583482/783836是一个非常简单的解决方案。

答案 1 :(得分:0)

此代码尝试尽可能忠实地均衡子文件的大小(而不是行数,不能同时满足两个条件)。我使用一些numpy工具来提高简洁性和可靠性。 np.searchsorted查找在原始文件中进行拆分的行号。

import numpy as np,math
lines=[]
lengths=[]
n=6

with open('file.txt') as f:
    for line in f:
        lines.append(line)
        lengths.append(len(line))

cumlengths = np.cumsum(lengths)
totalsize = cumlengths[-1]
chunksize = math.ceil(totalsize/n) # round to the next 
starts = np.searchsorted(cumlengths,range(0,(n+1)*chunksize,chunksize))  # places to split

for k in range(n):
    with open('out' + str(k+1) + '.txt','w') as f:
        s=slice(starts[k],starts[k+1])
        f.writelines(lines[s])
        print(np.sum(lengths[s])) # check the size

没有外部模块,starts也可以通过以下方式构建:

chuncksize = (sum(lengths)-1)//n+1
starts=[]
split=0
cumlength=0
for k,length in enumerate(lengths):
     cumlength += length
     if cumlength>=split:
         starts.append(k)
         split += chunksize  
starts.append(k+1)