Question

python：将文件拆分成两个文件的最快方法是什么，每个文件都有原始文件中一半的行数，这样两个文件中的每一行都是随机的？

例如：如果文件是 1 2 3 4 五 6 7 8 9 10

它可以分为：

3 2 10 9 1

4 6 8 五 7

Answer 1

这种操作通常称为“分区”。虽然没有内置分区功能，但我发现了这篇文章：Partition in Python。

鉴于该定义，您可以这样做：

import random

def partition(l, pred):
    yes, no = [], []
    for e in l:
        if pred(e):
            yes.append(e)
        else:
            no.append(e)
    return yes, no

lines = open("file.txt").readlines()
lines1, lines2 = partition(lines, lambda x: random.random() < 0.5)

请注意，这不一定完全将文件拆分为两个，但平均而言。

Answer 2

您只需加载文件，在结果列表中调用random.shuffle，然后将其拆分为两个文件（未经测试的代码）：

def shuffle_split(infilename, outfilename1, outfilename2):
    from random import shuffle

    with open(infilename, 'r') as f:
        lines = f.readlines()

    # append a newline in case the last line didn't end with one
    lines[-1] = lines[-1].rstrip('\n') + '\n'

    shuffle(lines)

    with open(outfilename1, 'w') as f:
        f.writelines(lines[:len(lines) // 2])
    with open(outfilename2, 'w') as f:
        f.writelines(lines[len(lines) // 2:])

random.shuffle就地lines进行随机播放，并且几乎完成所有工作。 Python的序列索引系统（例如lines[len(lines) // 2:]）使事情变得非常方便。

我假设文件不是很大，即它会很舒服地放在内存中。如果情况并非如此，那么您可能需要做一些更有趣的事情，可能使用linecache模块从输入文件中读取随机行号。我想你可能想要生成两个行号列表，使用与上面显示的类似的技术。

更新：将/更改为//以在启用__future__.division时避开问题。

Answer 3

import random
data=open("file").readlines()
random.shuffle(data)
c=1
f=open("test."+str(c),"w")
for n,i in enumerate(data):
     if n==len(data)/2:
         c+=1
         f.close()
         f=open("test."+str(c),"w")
     f.write(i)

Answer 4

其他版本：

from random import shuffle

def shuffle_split(infilename, outfilename1, outfilename2):
    with open(infilename, 'r') as f:
        lines = f.read().splitlines()

    shuffle(lines)
    half_lines = len(lines) // 2

    with open(outfilename1, 'w') as f:
        f.write('\n'.join(lines.pop() for count in range(half_lines)))
    with open(outfilename2, 'w') as f:
        f.writelines('\n'.join(lines))

python：将文件随机拆分为两个文件的最快方法

4 个答案: