我在网站上使用了这个beautifulsoup代码:
headers = ({'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'})
funda = "https://www.funda.nl/koop/amsterdam/"
response = get(funda, headers=headers)
print(response)
html_soup = BeautifulSoup(response.text, 'html.parser')
print(response.text)
然后我收到此响应。
<Response [200]>
<!DOCTYPE html>
<html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=01fe7635-8c6e-404f-b905-fd8d854fa40c&httpReferrer=%2Fkoop%2Famsterdam%2F" />
<script type="text/javascript">
(function(window){
try {
if (typeof sessionStorage !== 'undefined'){
sessionStorage.setItem('distil_referrer', document.referrer);
}
} catch (e){}
})(window);
</script>
<script type="text/javascript" src="/fundadst.rvezxdcvwbzdewcsbar.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#suuazwruefzeaa{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock"> </div>
</body>
</html>
我被阻止了吗?这个区块是永久的,我可以做些什么吗?
谢谢
答案 0 :(得分:0)
这似乎是您尝试使用python请求库抓取JavaScript呈现的网站吗?该库只能抓取静态站点,这就是为什么您在响应中收到JS块的原因。
您应该考虑切换到以下软件包之一:
还有其他一些包装chrome驱动程序的库,但维护的库并不多。
以下是使用网络抓取工具抓取的中级教程: https://medium.com/@djerahahmedrafik/web-scraping-fundamentals-using-scrapy-84a1e64b5ec
希望这会有所帮助。