Question

我是新手抓网站，但这是我迄今为止能够整理的内容：

page = urlopen(
    Request(
        https://www.example.com,
        data = None,
        headers={
            'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'
        }
     )
)

print(page.read())

设置User-Agent后我不再收到405错误，但我没有得到实际的网页，只有元数据：

b'<!DOCTYPEhtml>\n<html>\n\n\t\n\n\t\n\t\n\t\n\n\t\n\t\n\n\t\n\t\n\t\n\n<head>\n<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">\n<meta http-equiv="cache-control" content="max-age=0" />\n<meta http-equiv="cache-control" content="no-cache" />\n<meta http-equiv="expires" content="0" />\n<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />\n<meta http-equiv="pragma" content="no-cache" />\n<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?Ref=/&distil_RID=E302A72E-F80D-11E7-BB05-F067B31C80C1&distil_TID=20180113030018" />\n<script type="text/javascript">\n\t(function(window){\n\t\ttry {\n\t\t\tif (typeof sessionStorage !== \'undefined\'){\n\t\t\t\tsessionStorage.setItem(\'distil_referrer\', document.referrer);\n\t\t\t}\n\t\t} catch (e){}\n\t})(window);\n</script>\n<script type="text/javascript" src="/cndnrlsttdstl.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#xwdaqqdutzds{display:none!important}</style></head>\n<body>\n<div id="distil_ident_block"> </div>\n</body>\n</html>\n'

知道如何提取检查页面时看到的实际html吗？

尝试使用urllib阅读网页并获取元数据

0 个答案: