我想检查一下是否存在robots.txt
文件的URL。我在python 3中发现了urllib.robotparser
并尝试获得响应。但我无法找到一种方法来返回robotss.txt
from urllib import parse
from urllib import robotparser
def get_url_status_code():
URL_BASE = 'https://google.com/'
parser = robotparser.RobotFileParser()
parser.set_url(parse.urljoin(URL_BASE, 'robots.txt'))
parser.read()
# I want to return the status code
print(get_url_status_code())
答案 0 :(得分:1)
如果你可以使用强烈推荐的requests模块,这并不难做到
import requests
def status_code(url):
r = requests.get(url)
return r.status_code
print(status_code('https://github.com/robots.txt'))
print(status_code('https://doesnotexist.com/robots.txt'))
否则,如果您想避免使用GET请求,可以使用HEAD。
def does_url_exist(url):
return requests.head(url).status_code < 400
更好的是,
def does_url_exist(url):
try:
r = requests.head(url)
if r.status_code < 400:
return True
else:
return False
except requests.exceptions.RequestException as e:
print(e)
# handle your exception