Question

我想检查一个给定的网站是否包含robot.txt，读取该文件的所有内容并打印出来。也许将内容添加到词典中会非常好。

我尝试过玩robotparser module，但无法弄明白该怎么做。

我只想使用标准Python 2.7包附带的模块。

我做了@Stefano Sanfilippo建议：

from urllib.request import urlopen

返回

    Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    from urllib.request import urlopen
ImportError: No module named request

所以我试过了：

import urllib2
from urllib2 import Request
from urllib2 import urlopen
with urlopen("https://www.google.com/robots.txt") as stream:
    print(stream.read().decode("utf-8"))

但得到了：

Traceback (most recent call last):

文件“”，第1行，in 使用urlopen（“https://www.google.com/robots.txt”）作为流： AttributeError：addinfourl实例没有属性“退出”

来自bugs.python.org似乎2.7版本不支持。事实上，代码在Python 3中运行良好知道如何解决这个问题吗？

Answer 1

是的，robots.txt只是一个文件，下载并打印出来！

Python 3：

from urllib.request import urlopen

with urlopen("https://www.google.com/robots.txt") as stream:
    print(stream.read().decode("utf-8"))

Python 2：

from urllib import urlopen
from contextlib import closing

with closing(urlopen("https://www.google.com/robots.txt")) as stream:
    print stream.read()

请注意，路径始终为/robots.txt。

如果您需要将内容放入字典中，.split(":")和.strip()是您的朋友：

在Python中阅读robots.txt的内容并进行打印

1 个答案: