我试图通过将文本文件的总行平均划分为多个文件来分割文件。但是,它不会平均分割大小。有没有一种方法可以将文件分成多个相等的块,而无需使用Python 3.xx向文件中写入任何不完整的行?例如,一个100mb的文本文件将分为33mb,33mb和34mb。
这是到目前为止我得到的:
chunk=3
my_file = 'file.txt'
NUM_OF_LINES=-(-(sum(1 for line in open(my_file)))//chunk)+1
print(NUM_OF_LINES)
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
count = 0
increment = (outer_count-1) * NUM_OF_LINES
left = len(hold_lines) - increment
file_name = "text.txt_" + str(outer_count * NUM_OF_LINES) + ".txt"
hold_new_lines = []
if left < NUM_OF_LINES:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < NUM_OF_LINES:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
答案 0 :(得分:1)
如果保持行的顺序顺序并不重要,那么https://stackoverflow.com/a/30583482/783836是一个非常简单的解决方案。
答案 1 :(得分:0)
此代码尝试尽可能忠实地均衡子文件的大小(而不是行数,不能同时满足两个条件)。我使用一些numpy工具来提高简洁性和可靠性。 np.searchsorted
查找在原始文件中进行拆分的行号。
import numpy as np,math
lines=[]
lengths=[]
n=6
with open('file.txt') as f:
for line in f:
lines.append(line)
lengths.append(len(line))
cumlengths = np.cumsum(lengths)
totalsize = cumlengths[-1]
chunksize = math.ceil(totalsize/n) # round to the next
starts = np.searchsorted(cumlengths,range(0,(n+1)*chunksize,chunksize)) # places to split
for k in range(n):
with open('out' + str(k+1) + '.txt','w') as f:
s=slice(starts[k],starts[k+1])
f.writelines(lines[s])
print(np.sum(lengths[s])) # check the size
没有外部模块,starts
也可以通过以下方式构建:
chuncksize = (sum(lengths)-1)//n+1
starts=[]
split=0
cumlength=0
for k,length in enumerate(lengths):
cumlength += length
if cumlength>=split:
starts.append(k)
split += chunksize
starts.append(k+1)