使用Python通过行号将大文本文件拆分为较小的文本文件

时间:2013-04-29 23:25:07

标签: python file split lines

我有一个文本文件,上面写着包含:

的really_big_file.txt
line 1
line 2
line 3
line 4
...
line 99999
line 100000

我想编写一个Python脚本,将really_big_file.txt分成较小的文件,每行300行。例如,small_file_300.txt包含1-300行,small_file_600包含301-600行,依此类推,直到有足够的小文件包含大文件中的所有行。

我很感激有关使用Python

完成此任务的最简单方法的任何建议

7 个答案:

答案 0 :(得分:23)

lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()

答案 1 :(得分:20)

使用itertools grouper食谱:

from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

n = 300

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)

与将每行存储在列表中相比,此方法的优点是它可以逐行处理迭代,因此不必一次将每个small_file存储到内存中。

请注意,在这种情况下,最后一个文件将是small_file_100200,但只会到line 100000。发生这种情况是因为fillvalue='',这意味着当我没有剩余的行要写时,我写出 nothing ,因为组大小不均等。您可以通过写入临时文件然后重命名它而不是像我一样命名它来解决这个问题。这是如何做到的。

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

这次fillvalue=None和我检查None的每一行,当它发生时,我知道该过程已经完成,所以我从1减去j不计算填充符然后写入文件。

答案 2 :(得分:3)

import csv
import os
import re

MAX_CHUNKS = 300


def writeRow(idr, row):
    with open("file_%d.csv" % idr, 'ab') as file:
        writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
        writer.writerow(row)

def cleanup():
    for f in os.listdir("."):
        if re.search("file_.*", f):
            os.remove(os.path.join(".", f))

def main():
    cleanup()
    with open("large_file.csv", 'rb') as results:
        r = csv.reader(results, delimiter=',', quotechar='\"')
        idr = 1
        for i, x in enumerate(r):
            temp = i + 1
            if not (temp % (MAX_CHUNKS + 1)):
                idr += 1
            writeRow(idr, x)

if __name__ == "__main__": main()

答案 3 :(得分:2)

我这样做是一种更容易理解的方式,并且使用较少的捷径,以便让您进一步了解其工作原理和原因。以前的答案有效,但如果您不熟悉某些内置函数,则无法理解函数的作用。

因为你没有发布任何代码我决定这样做,因为你可能不熟悉除了基本的python语法以外的东西,因为你对这个问题的表达方式让你觉得好像你没有尝试也没有任何线索如何接近问题

以下是在基本python中执行此操作的步骤:

首先,您应该将您的文件读入列表以便妥善保管:

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)

其次,您需要设置一种按名称创建新文件的方法!我会建议一个循环以及几个计数器:

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"

第三,在该循环内部,您需要一些嵌套循环,将正确的行保存到数组中:

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1

最后一件事,在你的第一个循环中,你需要编写新文件并添加你的最后一个计数器增量,这样你的循环将再次通过并写一个新文件

outer_count += 1
with open(file_name,'w') as next_file:
    for row in hold_new_lines:
        next_file.write(row)

注意:如果行数不能被300整除,则最后一个文件的名称将与最后一个文件行不对应。

了解这些循环的工作原理非常重要。您已将其设置为在下一个循环中,您编写的文件的名称会更改,因为您的名称取决于更改的变量。这是一个非常有用的脚本工具,用于文件访问,打开,编写,组织等。

如果您无法关注循环中的内容,则以下是函数的全部内容:

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)

答案 4 :(得分:0)

lines_per_file = 300  # Lines on each small file
lines = []  # Stores lines not yet written on a small file
lines_counter = 0  # Same as len(lines)
created_files = 0  # Counting how many small files have been created

with open('really_big_file.txt') as big_file:
    for line in big_file:  # Go throught the whole big file
        lines.append(line)
        lines_counter += 1
        if lines_counter == lines_per_file:
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write all lines on small file
                small_file.write('\n'.join(stored_lines))
            lines = []  # Reset variables
            lines_counter = 0
            created_files += 1  # One more small file has been created
    # After for-loop has finished
    if lines_counter:  # There are still some lines not written on a file?
        idx = lines_per_file * (created_files + 1)
        with open('small_file_%s.txt' % idx, 'w') as small_file:
            # Write them on a last small file
            small_file.write('n'.join(stored_lines))
        created_files += 1

print '%s small files (with %s lines each) were created.' % (created_files,
                                                             lines_per_file)

答案 5 :(得分:0)

我必须对650000个行文件执行相同的操作。

使用枚举索引并将整数(//)与块大小相乘

当该数字更改时,关闭当前文件并打开一个新文件

这是使用格式字符串的python3解决方案。

chunk = 50000  # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')

with open('massive_web_log_file') as file_to_read:
    for i, line in enumerate(file_to_read.readlines()):
        file_name = f'./a_folder/{i // chunk}'
        print(i, file_name)  # a bit of feedback that slows the process down a

        if file_name == this_small_file.name:
            this_small_file.write(line)

        else:
            this_small_file.write(line)
            this_small_file.close()
            this_small_file = open(f'{file_name}', 'a')

答案 6 :(得分:0)

文件设置为要将主文件分割为的文件数 例如,我想从主文件中获取10个文件

files = 10
with open("data.txt","r") as data :
    emails = data.readlines()
    batchs = int(len(emails)/10)
    for id,log in enumerate(emails):
        fileid = id/batchs
        file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
        file.write(log)