Question

我得到一个包含100万个文件的文件夹。

我希望在此文件夹中以Python或其他脚本语言列出文件时立即开始处理。

通常的函数（python中的os.listdir ...）是阻塞的，我的程序必须等待列表的结尾，这可能需要很长时间。

列出大文件夹的最佳方法是什么？

Answer 1

如果方便，请更改目录结构;但如果没有，你可以use ctypes to call opendir and readdir。

这是该代码的副本;我所做的就是缩进它，添加try/finally块，并修复一个bug。您可能需要调试它。特别是结构布局。

请注意，此代码不可移植。你需要在Windows上使用不同的功能，我认为结构因Unix而异。

#!/usr/bin/python
"""
An equivalent os.listdir but as a generator using ctypes
"""

from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library

class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass
c_dir_p = POINTER(c_dir)

class c_dirent(Structure):
    """Directory entry"""
    # FIXME not sure these are the exactly correct types!
    _fields_ = (
        ('d_ino', c_long), # inode number
        ('d_off', c_long), # offset to the next dirent
        ('d_reclen', c_ushort), # length of this record
        ('d_type', c_byte), # type of file; not supported by all file system types
        ('d_name', c_char * 4096) # filename
        )
c_dirent_p = POINTER(c_dirent)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

def listdir(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)
    try:
        while True:
            p = readdir(dir_p)
            if not p:
                break
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        closedir(dir_p)

if __name__ == "__main__":
    for name in listdir("."):
        print name

Answer 2

这感觉很脏但是应该这样做：

def listdirx(dirname='.', cmd='ls'):
    proc = subprocess.Popen([cmd, dirname], stdout=subprocess.PIPE)
    filename = proc.stdout.readline()
    while filename != '':
        yield filename.rstrip('\n')
        filename = proc.stdout.readline()
    proc.communicate()

用法：listdirx('/something/with/lots/of/files')

Answer 3

对于那些从谷歌出来的人来说，PEP 471为Python 3.5标准库添加了一个合适的解决方案，它作为PIP上的scandir模块被反向移植到Python 2.6+和3.2+。

来源：https://stackoverflow.com/a/34922054/435253

Python 3.5 +：

os.walk已更新为使用此基础架构以获得更好的性能。
os.scandir返回DirEntry个对象上的迭代器。

Python 2.6 / 2.7和3.2 / 3.3 / 3.4：

scandir.walk是os.walk
scandir.scandir返回DirEntry个对象上的迭代器。

scandir()迭代器在POSIX平台上包裹opendir / readdir，在Windows上包裹FindFirstFileW / FindNextFileW。

返回DirEntry个对象的目的是允许缓存元数据，以最大限度地减少系统调用的次数。（例如。DirEntry.stat(follow_symlinks=False)从不在Windows上进行系统调用，因为FindFirstFileW和FindNextFileW函数免费提供stat信息）

来源：https://docs.python.org/3/library/os.html#os.scandir

Answer 4

以下是关于如何在Windows上按文件遍历大型目录文件的答案！

我像疯子一样搜索了一个Windows DLL，它允许我在Linux上做了什么，但没有运气。

所以，我得出结论，唯一的方法是创建我自己的DLL，将这些静态函数暴露给我，但后来我记得pywintypes。而且，是的！这已经在那里完成了。而且，迭代器功能已经实现了！酷！

使用FindFirstFile（），FindNextFile（）和FindClose（）的Windows DLL可能仍然在某处，但我找不到它。所以，我使用了pywintypes。

编辑：他们隐藏在kernel32.dll中。请参阅ssokolow的回答，以及我对它的评论。

对不起依赖。但是我认为你可以从... \ site-packages \ win32文件夹和最终的依赖项中提取win32file.pyd，并且如果你需要，可以使用你的程序独立于win32types分发它。

在搜索如何执行此操作时，我发现了这个问题，以及其他一些问题。

下面：

How to copy first 100 files from a directory of thousands of files using python?

我从这里（由Jason Orendorff发布）发布了Linux版本的listdir（）以及我在这里提供的Windows版本的完整代码。

所以任何人都想要一个或多或少的跨平台版本，去那里或自己组合两个答案。

编辑：或者更好的是，使用scandir模块或os.scandir（）（在Python 3.5中）和以下版本。它更好地处理错误和其他一些东西。

from win32file import FindFilesIterator
import os

def listdir (path):
    """
    A generator to return the names of files in the directory passed in
    """
    if "*" not in path and "?" not in path:
        st = os.stat(path) # Raise an error if dir doesn't exist or access is denied to us
        # Check if we got a dir or something else!
        # Check gotten from stat.py (for fast checking):
        if (st.st_mode & 0170000) != 0040000:
            e = OSError()
            e.errno = 20; e.filename = path; e.strerror = "Not a directory"
            raise e
        path = path.rstrip("\\/")+"\\*"
    # Else:  Decide that user knows what she/he is doing
    for file in FindFilesIterator(path):
        name = file[-2]
        # Unfortunately, only drives (eg. C:) don't include "." and ".." in the list:
        if name=="." and name=="..": continue
        yield name

将文件夹中的文件列为流，以立即开始处理

4 个答案: