Question

我在网站上使用了这个beautifulsoup代码：

headers = ({'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'})
funda = "https://www.funda.nl/koop/amsterdam/"
response = get(funda, headers=headers)
print(response)
html_soup = BeautifulSoup(response.text, 'html.parser')

print(response.text)

然后我收到此响应。

<Response [200]>
<!DOCTYPE html>
<html>

<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=01fe7635-8c6e-404f-b905-fd8d854fa40c&httpReferrer=%2Fkoop%2Famsterdam%2F" />
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script type="text/javascript" src="/fundadst.rvezxdcvwbzdewcsbar.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#suuazwruefzeaa{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock">&nbsp;</div>
</body>
</html>

我被阻止了吗？这个区块是永久的，我可以做些什么吗？

谢谢

Answer 1

这似乎是您尝试使用python请求库抓取JavaScript呈现的网站吗？该库只能抓取静态站点，这就是为什么您在响应中收到JS块的原因。

您应该考虑切换到以下软件包之一：

Selenium（使用无头浏览器）
Scrapy（使用Spider爬网）

还有其他一些包装chrome驱动程序的库，但维护的库并不多。

这是有关硒刮除的中级教程：https://medium.com/@hoppy/how-to-test-or-scrape-javascript-rendered-websites-with-python-selenium-a-beginner-step-by-c137892216aa

以下是使用网络抓取工具抓取的中级教程： https://medium.com/@djerahahmedrafik/web-scraping-fundamentals-using-scrapy-84a1e64b5ec

希望这会有所帮助。

如何知道您是否因网站抓取而被阻止？

1 个答案: