从URL中找出网站结构

时间:2013-06-25 20:05:03

标签: python tree

我正在为网站构建信息架构。目标是刮取URL并使用python脚本基于url构建不同子目录的关系树。我能够收集网址,但我如何解析网址并找出网站层次结构。

Raw data:
http://www.ex.com
http://www.ex.com/product_cat_1/
http://www.ex.com/product_cat_1/item_1
http://www.ex.com/product_cat_1/item_2
http://www.ex.com/product_cat_2/
http://www.ex.com/product_cat_2/item_1
http://www.ex.com/product_cat_2/item_2
http://www.ex.com/terms_and_conditions/
http://www.ex.com/Media_Center



Example output:
{'url':'http://www.ex.com', 'sub_dir':[
{'url':'http://www.ex.com/product_cat_1/', 'sub_dir':[
                        {'url':'http://www.ex.com/product_cat_1/item_1'}, {'url':'http://www.ex.com/product_cat_1/item_2'}]},
{'url':'http://www.ex.com/product_cat_2/', 'sub_dir':[
                        {'url':'http://www.ex.com/product_cat_2/item_1'},
                        'url':'http://www.ex.com/product_cat_2/item_2']},
{'url':'http://www.ex.com/terms_and_conditions/'},
{'url':'http://www.ex.com/Media_Center'},
]}

0 个答案:

没有答案