Question

我使用Python解析通过wget下载的WordPress网站。所有HTML文件都嵌套在一个复杂的文件夹结构中（感谢WordPress及其长URL），如site_dump/2010/03/11/post-title/index.html。

但是，在post-title目录中，还有其他目录可供Feed和基于Google新闻的基于数字的索引：

site_dump/2010/03/11/post-title/index.html  # I want this
site_dump/2010/03/11/post-title/feed/index.html  # Not these
site_dump/2010/03/11/post-title/115232/site.com/2010/03/11/post-title/index.html

我只想访问第5个嵌套级别（site_dump/2010/03/11/post-title/index.html）的index.html文件，而不是更多。现在我将root变量在/循环中用斜杠（os.walk）拆分，如果它在5级文件夹中，则只处理该文件：

import os

for root, dirs, files in os.walk('site_dump'):
  nested_levels = root.split('/')
  if len(nested_levels) == 5:
    print(nested_levels)  # Eventually do stuff with the file here

但是，这似乎效率低下，因为os.walk仍在遍历那些非常深的文件夹。有没有办法限制遍历目录树时os.walk的深度？

Answer 1

for root, dirs, files in os.walk('site_dump'):
  nested_levels = root.split('/')
  if len(nested_levels) == 5:
    del dirs[:]
    # Eventually do stuff with the file here

del dirs[:]将删除列表中的内容，而不是将dirs替换为对新列表的引用。执行此操作时，重要的是修改列表就地。

从docs开始，topdown引用您省略的os.walk的可选参数，默认为True：

当topdown为True时，调用者可以就地修改dirnames列表（可能使用del或slice赋值），而walk（）只会递归进入名称保留在dirnames中的子目录;这可以用来修剪搜索，强加一个特定的访问顺序，甚至通知walk（）有关调用者创建或重命名的目录在它再次恢复步行（）之前。自上而下修改dirnames 假是无效的，因为在自下而上模式下的目录 dirnames是在生成dirpath本身之前生成的。

限制os.walk遍历的嵌套目录的数量

1 个答案: