从txt文件解析url

时间:2013-08-27 23:03:32

标签: python parsing robots.txt

我正在尝试解析一个如下所示的txt文件:

Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html

我需要在'Disallow'之后读取文件并使用url提取部分,但也忽略了注释。提前谢谢。

2 个答案:

答案 0 :(得分:5)

如果您要解析 robots.txt 文件,则应使用robotparser模块:

>>> import robotparser

>>> r = robotparser.RobotFileParser()
>>> r.set_url("http://www.your_url.com/robots.txt")
>>> r.read()

然后检查:

>>> r.can_fetch("*", "/foo.html")
False

答案 1 :(得分:1)

假设网址中没有#

with open('path/to/file') as infile:
    URLs = [line.strip().lstrip("Disallow:").split("#", 1)[0] for line in infile]

允许存在#,但假设以#开头的注释和网址用空格分隔:

with open('path/to/file') as infile:
    URLs = [line.strip().lstrip("Disallow:").split(" #", 1)[0] for line in infile]