Question

我有一个巨大的文件列表（20k）。每个文件在第一行中都有唯一的标识符字符串。第一行仅包含此标识符字符串。文件列表具有大约n个不同的标识符，每个标识符至少有500个文件（但每个标识符的文件数量不相等）。

我需要随机抽样500个文件（每个标识符）并将它们复制到另一个目录，这样我最终得到原始列表的一个子集，每个标识符都用相同数量的文件表示

我知道random.sample()可以给我一个随机列表，但不会处理第一行中的约束，shutil.copy()可以复制文件......

但是如何通过遵守文件第一行中标识符的约束来在python中（高效地）这样做呢？

Answer 1

根据您所描述的内容，您必须阅读每个文件的第一行，以便按标识符对其进行整理。像我这样的东西，我认为会做你想要的：

import os
import collections
import random
import shutil

def get_identifier(path):
    with open(path) as fd:
        return fd.readline().strip()       #assuming you don't want the \n in the identifier

paths = ['/home/file1', '/home/file2', '/home/file3']
destination_dir = '/tmp'
identifiers = collections.defaultdict(list)
for path in paths:
    identifier = get_identifier(path)
    identifiers[identifier].append(path)

for identifier, paths in identifiers.items():
    sample = random.sample(paths, 500)
    for path in sample:
        file_name = os.path.basename(path)
        destination = os.path.join(destination_dir, file_name)
        shutil.copy(path, destination)

根据文件中的条件随机抽取文件

1 个答案: