Question

我有一个遍历目录树的函数，该目录树搜索指定文件类型的文件，它可以很好地解决我遇到的唯一问题，它可能非常慢。任何人都可以提供更多的pythonic建议，以加快进程：

def findbyfiletype (filetype, directory):
"""

    findbyfiletype allows the user to search by two parameters, filetype and directory.

    Example:
        If the user wishes to locate all pdf files with a directory including subdirectories
        then the function would be called as follows:

        findbyfiletype(".pdf", "D:\\\\")

        this will return a dictionary of strings where the filename is the key and the file path is the value
        e.g.
            {'file.pdf':'c:\\folder\\file.pdf'}


        note that both parameters filetype and directory must be enclosed in string double or single quotes
        and the directory parameter must use the backslash escape \\\\  as opposed to \ as python will throw a string literal error
"""

indexlist =[]                       #holds all files in the given directory including sub folders
FiletypeFilenameList =[]            #holds list of all filenames of defined filetype in indexlist
FiletypePathList = []               #holds path names to indvidual files of defined filetype

for root, dirs, files in os.walk(directory):
    for name in files:
        indexlist.append(os.path.join(root,name))
        if filetype in name[-5:]:
            FiletypeFilenameList.append(name)

for files in indexlist:
    if filetype in files[-5:]:
        FiletypePathList.append(files)

FileDictionary=dict(zip(FiletypeFilenameList, FiletypePathList))
del indexlist, FiletypePathList, FiletypeFilenameList

return FileDictionary

好吧这就是我最终使用@Ulrich Eckhardt @Anton和@Cox

组合的结果

import os
import scandir

def findbyfiletype (filetype, directory):
    FileDictionary={}

    for root, dirs, files in scandir.walk(directory):
        for name in files:
            if filetype in name and name.endswith(filetype):
                FileDictionary.update({name:os.path.join(root,name)})

return FileDictionary

正如您所看到的，已经重新考虑了摆脱不必要的列表并一步创建字典。 @Anton你对scandir模块的建议大大减少了一个实例中约97％的时间，这几乎让我大吃一惊。

我将@Anton列为重复的答案，因为它总结了我实际上通过重构实现的所有内容，但@Ulrich Eckhardt和@Cox都得到了投票，因为你们都很有帮助

问候

Answer 1

您可以使用速度更快的scandir模块（PEP-471）代替os.walk()。

另外，还有一些其他提示：

不要使用任意[-5:]。使用ensdswith()字符串方法或使用os.path.splitext()。
不要建立两个长名单，然后制作一个词典。直接建立字典。
如果逃避反斜线打扰你，请使用正向斜杠，例如'c：/folder/file.pdf'。他们只是工作。

Answer 2

walk（）可能会很慢，因为尝试覆盖很多东西。我使用一个简单的变体：

def walk(self, path):
    try:
        l = (os.path.join(path, x) for x in os.listdir(path))
        for x in l:
            if os.path.isdir(x):self.walk(x)
            elif x.endswith(("jpg", "png", "jpeg")):
                self.lf.append(x)
    except PermissionError:pass

快速和python执行文件系统的本地缓存，因此第二次调用更快。

PS：函数walk是一个类的成员，很明显，这就是为什么“self”存在的原因。

编辑：在NTFS中，不要为islink烦恼。使用try / except更新。

但这只是忽略了你没有权限的dirs。如果要列出脚本，则必须以管理员身份运行脚本。

filetype os.walk搜索加速代码

2 个答案: