我正在尝试将两个巨大的源文件(每行中包含相应的内容)拆分成几个较小的文件,每个文件都包含唯一的输入,但到目前为止,事情已经关闭。我想有一个方法读取输出目录中的所有文件,并将其内容加入某个黑名单。所以首先,这个黑名单是空的,因为文件是空的,我想读取源文件并将n
行复制到第一个较小的文件中,并将内容添加到黑名单中。接下来,我正在检查列表并将行写入第二个文件n
次,只要它们不在所述黑名单中。出于某种原因,在我追加第一次阅读的内容之后,我没有得到黑名单的任何输入。
这就是我得到的:
def check_overlap(path):
# check if lines appear in other files
content = []
for filename in os.listdir(path):
with open(path + filename, "r", encoding="utf-8") as f:
content.append(f.read())
print(filename + str(content))
# when I print this out, it's empty for the first file
# the other 3 files have the desired output, but why?
# How is it empty after I appended the content of f?
f.close()
all_content = "".join(content)
return all_content
def shuffle_data(n, source, output):
# shuffle source into n portions while keeping each line unique
with open(output, "w", encoding="utf-8") as shuffled_file:
existing_files = check_overlap()
with open(source, 'r', encoding="utf-8") as source:
i = 0
for line in source:
if i < n and line not in existing_files:
shuffled_file.write(line)
i += 1
shuffle_data(50, "source1", "output_50A")
shuffle_data(50, "source2", "output_50B")
shuffle_data(200, "source1", "output_200A")
shuffle_data(200, "source2", "output_200B")
这也意味着我整体输出错误。来源看起来像这样:
File 1 File 2
dog dogs
book books
horse horses
flower flowers
egg eggs
他们必须保留相应的行,但由于我得到的错误:
Output 1 Output 2
dog dogs
book books
horse flowers
flowers eggs
所以它似乎正在跳过随机线,因为黑名单不稳定。每次运行程序时,这些源都是随机的,因此它们始终不同,它们在哪一行开始发散。所有输出文件都在同一目录中,来源不同。
答案 0 :(得分:0)
根据评论,尝试这样的事情。如果源文件在大小之前用完,则可能需要处理异常。
Sizes = [50, 100, 200, 600, 1000, 3000, 10000]
with open('file1') as f1:
with open('file2') as f2:
sources = iter(zip(f1, f2))
for size in Sizes:
o1_name = 'output_{}A'.format(size)
o2_name = 'output_{}B'.format(size)
with open(o1_name, 'w') as o1:
with open(o2_name, 'w') as o2:
for _ in range(size):
l1,l2 = next(sources)
o1.write(l1.strip())
o2.write(l2.strip())