Question

使用python获取特定文件的文件名的最快方法是什么（无需先加载整个文件列表）？

我有数千个文件的目录，需要访问这些目录中的特定文件。具体来说，我需要的文件是例如文件列表中的第1000个。我想这样做而不读取所有文件然后选择我想要的文件。有没有办法指定文件的索引（例如，目录中列出的第1000个）并使python（或OS）只返回该特定文件的名称？

我需要对不同目录中的不同文件一次又一次地执行此操作，因此不希望加载每个目录中的所有文件，因为它将花费太长时间。

提前致谢。

Answer 1

如果没有迭代完第一个999，就无法访问第1000个文件，但如果你有Python 3.5添加os.scandir（或者安装scandir，则不需要迭代整个目录。旧版Python的第三方软件包。）

与islice合并，您可以轻松跳过前999个条目：

import itertools
import os

# Will raise StopIteration if you don't have 1000 files
file1000 = next(itertools.islice(os.scandir(somedir), 999, None)).path

请注意，文件的目录排序不一定按时间戳，名称等有意义地排序，因此赔率是，第1000个条目是不可预测的。您可能希望找到一种通过名称识别正确文件的方法，而不是按任意列表顺序扫描它。

如果你确实需要除了自然迭代顺序之外的某个顺序的第X个条目，你需要迭代整个事情来对它进行排序，但是os.scandir仍然可以为你节省一些工作;它通常比os.listdir更快，并且取决于操作系统可能会“免费”提供一些统计信息，避免每个文件stat;例如，如您在评论中提到的，您希望按时间戳排序，并且您可能希望跳过目录并仅计算文件：

from operator import methodcaller

# Only count files for finding entry #1000
filesonly = filter(methodcaller('is_file'), os.scandir(somedir))

# Sort by time, and keep the thousandth
# On Windows, you may want st_ctime instead of st_mtime
# Raises IndexError if < 1000 files in dir
file1000 = sorted(filesonly, key=lambda x: x.stat().st_mtime)[999].path

将sorted替换为heapq.nsmallest，可以略微降低峰值内存成本;如果要检索的数字占总输入的很大一部分，它会慢一点，但它会限制内存使用量（如果目录包含数百万个文件，并且您只需要＃1000，则可以更快）：

from heapq import nsmallest

# Get the 1000th file never storing info on more than 1000 at a time
file1000 = nsmallest(1000, filesonly, key=lambda x: x.stat().st_mtime)[999].path

您无法在此处避免某些处理，但与非基于stat的解决方案相比，它可能会大大减少内存开销和每文件scandir开销。

根据您的评论，您似乎确实希望按字母顺序排列第1000个文件，而不是通过修改时间或目录顺序（ls命令按字母顺序自动排序，您只看到运行/bin/ls -U的真实目录顺序）。您似乎也只关心以.fits结尾的文件，并且只需要文件，而不是目录。在这种情况下，完整的解决方案就是：

from operator import attrgetter

# Keep only files with matching extension
filesonly = (e for e in os.scandir(somedir) if e.is_file() and e.name.endswith('.fits'))

# Keep the "smallest" 1000 entries sorted alphabetically by name
# then pull off the 1000th entry
# End with .name instead of .path if you don't need the whole path
file1000 = nsmallest(1000, filesonly, key=attrgetter('name'))[999].path

Answer 2

如果您拥有 Python 3 ，则可以使用subprocess以这种方式执行此操作。它仅适用于 linux 。

import subprocess

my_dir = r"/foo/bar"  #Assign your directory path here
extension = r'*.fits' #File extension to be searched for
nth_file = str('1000')  #nth file in the directory order

#If you want the files sorted in the timestamp order, you can replace 'ls -1U' with 'ls -tU'
cmd1 = r'ls -1U '+my_dir+extension # ls -1U /foo/bar/*.fits
cmd2 = r'sed "'+nth_file+'q;d"' # sed "1000q;d"

ls_output = subprocess.Popen(cmd1, shell=True,universal_newlines=True, stdout=subprocess.PIPE)
final_output = subprocess.Popen(cmd2, shell=True, universal_newlines=True, stdin=ls_output.stdout, stdout=subprocess.PIPE)
req_file_path = final_output.communicate()[0]

#Retrieving only filename from full path
index = req_file_path.rfind('/')
file_name = req_file_path[index+1:]

print(file_name)

使用python从目录中获取特定文件的文件名的最快方法

2 个答案: