Question

我在一个相当慢的磁盘上有一个包含数百万个项目的目录。我想随机抽样100个这样的项目，我也想用glob来做。

一种方法是在目录中获取每个文件的全局，然后对其进行采样：

files = sorted(glob.glob('*.xml'))
file_count = len(files)
random_files = random.sample(
    range(0, file_count),
    100
)

但这真的很慢，因为我必须建立数百万个文件的大清单，这些文件必须进行大量的磁盘爬行。

有没有更快的方法来做到这一点，并没有那么多？它不一定是一个完美分布的样本，甚至可以做100个项目，只要它很快。

我在想：

也许我们可以使用inode更快？
也许我们可以在不知道磁盘上的全部内容的情况下选择项目？
也许有一些快捷方式可以让它更快。

Answer 1

使用os.listdir代替glob。它的速度提高了两倍。

import glob, random, os,  time

n, t = 0, time.time()
files = sorted(glob.glob('tmp/*'))
file_count = len(files)
print(file_count)
random_files = random.sample(range(0, file_count), 100)
t = time.time() - t 
print "glob.glob: %.4fs, %d files found" % (t, file_count)

n, t = 0, time.time()
files = sorted(os.listdir("tmp/" ))
file_count = len(files)
print(file_count)
random_files = random.sample(range(0, file_count), 100)
t = time.time() - t 
print "os.listdir: %.4fs, %d files found" % (t, file_count)

输出

glob.glob: 0.6782s, 124729 files found
os.listdir: 0.3183s, 124778 files found

注意，如果有一些关于文件名的信息可以让你随机生成它们，那将是最佳选择。或者，如果您可以将文件重命名为适合随机抽样的格式，也可以使用。

Answer 2

也许我们可以使用inode更快？

不是inode，而是目录条目，你不想打电话每个文件stat()

也许我们可以在不知道磁盘上的全部内容的情况下选择项目？

是的，这就是计划。打开目录，读取目录条目，取样百万分之一，然后才能获取文件

在C中，这将是opendir()/readdir()次调用

在python中，类似的调用由scandir执行，应该包含在Python 3.5 RTL中。如果没有，请从https://github.com/benhoyt/scandir

获取

更新

链接到Opengroup docs wrt opendir()/readdir()：http://pubs.opengroup.org/onlinepubs/009695399/functions/opendir.html

在Python

2 个答案: