我有以下目录结构
.
例如,一条路径可能是
/mnt/type/split/v2/doc/RESOURCE_ID/YYYY/FY/DOCUMENT_ID
其中
/mnt/type/split/v2/doc/100045/2008/FY/28
注意,DOCUMENT_ID是路径中的最后一个目录 - DOCUMENT_ID目录中将有文件
我试图使用以下代码
来清点这个结构RESOURCE_ID = 100045
YYYY = 2008
DOCUMENT_ID = 28
我在magic_paths列表中获得了每个路径的五个副本。我有1,500,000个路径,所以我的列表中有7,500,00个项目。
前1,500,000是唯一值。接下来的6,000,000组由以RESOURCE_ID为根的组组成,重复4次
def survey():
magic_paths = []
for (resource_id, dirname,filename) in os.walk('/mnt/type/split/v2/doc'):
if resource_id:
for (magic_path, dirname2,filename2) in os.walk(resource_id):
if len(magic_path.split(os.sep)) == 10:
magic_paths.append(magic_path + os.linesep)
write_survey(magic_paths)
x = len(magic_paths)
return x
每个级别的目录和子目录中都有各种文件,我只需要清点DOCUMENT_ID的路径。
我不明白为什么结果会被图案化。我相信我从RESOURCE_ID开始,只找到了9个深度的目录,因为在os.sep上拆分给了我一个包含10个项目的列表。
/mnt/type/split/v2/doc/100045/2008/FY/28 #obs_1
/mnt/type/split/v2/doc/100045/2008/FY/29 #obs_2
/mnt/type/split/v2/doc/100045/2008/FY/30 #obs_3
/mnt/type/split/v2/doc/100045/2008/FY/31 #obs_4
/mnt/type/split/v2/doc/1028/2008/FY/28 #obs_5 # see the new RESOURCE_ID
.
. 1,499,995 more unique values
.
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of first repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of second repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of third repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of fourth repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/1028/2008/FY/28 #series of 4 repetitions based on RESOURCE ID 1028
回应评论中的问题
答案 0 :(得分:1)
os.walk()
将递归遍历目录结构。对于您遇到的每个目录,您可以启动另一个递归调用。因此,对于每个目录,您递归地遍历该目录以及所有嵌套目录。这包括嵌套目录。通过开始搜索/mnt/type/split/v2/doc
,/mnt/type/split/v2/doc/100045
,/mnt/type/split/v2/doc/100045/2008
,/mnt/type/split/v2/doc/100045/2008
和/mnt/type/split/v2/doc/100045/2008/FY
路径,您可以为每个文档ID生成5个匹配项。
只需拨打os.walk()
一次:
def survey():
magic_paths = []
for (resource_id, dirnames, filenames) in os.walk('/mnt/type/split/v2/doc'):
if len(resource_id.split(os.sep)) == 10:
magic_paths.append(resource_id + os.linesep)
write_survey(magic_paths)
x = len(magic_paths)
return x
您可能希望在找到匹配后修剪搜索;找到DOCUMENT_ID
目录后,搜索其他子目录是没有意义的:
def survey():
magic_paths = []
for (resource_id, dirnames, filenames) in os.walk('/mnt/type/split/v2/doc'):
if len(resource_id.split(os.sep)) == 10:
magic_paths.append(resource_id + os.linesep)
dirnames[:] = [] # clear the subdirs list to stop further recursion here
write_survey(magic_paths)
x = len(magic_paths)
return x