Python 2.7.5 Win / Mac。
我正在尝试找到在多个存储(大约128Tio)上搜索文件(超过10000)的最佳方法。这些文件有特定的扩展名,我可以忽略一些文件夹。
这是我的第一个函数os.listdir
和递归:
count = 0
def SearchFiles1(path):
global count
pathList = os.listdir(path)
for i in pathList:
subPath = path+os.path.sep+i
if os.path.isfile(subPath) == True :
fileName = os.path.basename(subPath)
extension = fileName[fileName.rfind("."):]
if ".ext1" in extension or ".ext2" in extension or ".ext3" in extension:
count += 1
#do stuff . . .
else :
if os.path.isdir(subPath) == True:
if not "UselessFolder1" in subPath and not "UselessFolder1" in subPath:
SearchFiles1(subPath)
它有效,但我认为它可能更好(更快更正确)或者我错了吗?
所以我尝试了os.path.walk
:
def SearchFiles2(path):
count = 0
for dirpath, subdirs, files in os.walk(path):
for i in dirpath:
if not "UselessFolder1" in i and not "UselessFolder1" in i:
for y in files:
fileName = os.path.basename(y)
extension = fileName[fileName.rfind("."):]
if ".ext2" in extension or ".ext2" in extension or ".ext3" in extension:
count += 1
# do stuff . . .
return count
“计数”是错误的并且慢一点。而且我认为我并不真正了解path.walk
的工作原理。
我的问题是:我可以做些什么来优化这项研究?
答案 0 :(得分:1)
您的第一个解决方案是合理的,除非您可以使用os.path.splitext
。在第二个解决方案中,它不正确,因为您重新访问每个子目录的文件列表而不是只处理它一次。对于os.path.walk
,诀窍是从subdirs
中删除的目录不是下一轮枚举的一部分。
def SearchFiles2(path):
useless_dirs = set(("UselessFolder1", "UselessFolder2"))
useless_files = set((".ext1", ".ext2"))
count = 0
for dirpath, subdirs, files in os.walk(path):
# remove unwanted subdirs from future enumeration
for name in set(subdirs) & useless_dir:
subdirs.remove(name)
# list of interesting files
myfiles = [os.path.join(dirpath, name) for name in files
if os.path.splitext(name)[1] not in useless_files]
count += len(myfiles)
for filepath in myfiles:
# example shows file stats
print(filepath, os.stat(filepath)
return count
枚举单个存储单元的文件系统只能这么快。加快这一速度的最佳方法是在不同的线程中运行不同存储单元的枚举。
答案 1 :(得分:0)
因此,在与tdelaney进行测试和讨论之后,我优化了以下两种解决方案:
import os
count = 0
target_files = set((".ext1", ".ext2", ".ext3")) # etc
useless_dirs = set(("UselessFolder2", "UselessFolder2")) # etc
# it could be target_dirs, just change `in` with `not in` when compared.
def SearchFiles1(path):
global count
pathList = os.listdir(path)
for content in pathList:
fullPath = os.path.join(path,content)
if os.path.isfile(fullPath):
if os.path.splitext(fullPath)[1] in target_files:
count += 1
#do stuff with 'fullPath' . . .
else :
if os.path.isdir(fullPath):
if fullPath not in useless_dirs:
SearchFiles1(fullPath)
def SearchFiles2(path):
count = 0
for dirpath, subdirs, files in os.walk(path):
for name in set(subdirs) & useless_dirs:
subdirs.remove(name)
for filename in [name for name in files if os.path.splitext(name)[1] in target_files]:
count += 1
fullPath = os.path.join(dirpath, filename)
#do stuff with 'fullPath' . . .
return count
在Mac / PC v2.7.5
上运行正常关于速度,它完全均匀。