我最近一直在尝试抓取一个电子商务网站。一开始,我一直重定向到“您是机器人吗?”页。然后,我开始使用浏览器用户代理,用于Javascript的scrapy-splash和5秒的下载延迟。现在,没有错误,但是没有呈现正确的页面。
spider.py
class ClassifiedsSpider(scrapy.Spider):
name = 'classifieds'
allowed_domains = ['dubai.dubizzle.com']
start_urls = ['http://dubai.dubizzle.com/classified/']
def parse(self, response):
url = response.xpath("//h3[@id='title']/span[@class='title']/a/@href").extract_first()
yield SplashRequest(url, self.get_details,
endpoint='render.html',
args={'wait': 0.5, 'proxy':'http://89.212.66.36:8080'},
)
def get_details(self, response):
title = response.xpath("//h1[@id='title']/span[@id='listing-title-wrap']").extract_first()
item = Product()
item['title'] = title
yield item
with open("body.txt", "a") as f:
f.write(response.body.decode("utf-8"))
呈现的HTML
<!DOCTYPE html><html><head>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0">
<meta http-equiv="cache-control" content="no-cache">
<meta http-equiv="expires" content="0">
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT">
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=69660faf-3162-4101-9087-7d1dd8930125&httpReferrer=%2Fclassified%2Ffurniture-home-garden%2Ffurniture%2Fsofas-futons-lounges%2F2018%2F7%2F25%2Fpure-leather-sofa-2%2F%3Fback%3DL2NsYXNzaWZpZWQv%26pos%3D1">
<script type="text/javascript">
(function(window){
try {
if (typeof sessionStorage !== 'undefined'){
sessionStorage.setItem('distil_referrer', document.referrer);
}
} catch (e){}
})(window);
</script>
<script type="text/javascript" src="/dstldbzzlxhr.js" defer=""></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#yuatbtfwcdza{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock"> </div>
<div id="d__fFH" style="position: absolute; top: -5000px; left: -5000px;"><object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object><span id="d__fF" style="font-family: Courier, serif; font-size: 200px; visibility: hidden;">The quick brown fox jumps over the lazy dog.</span></div></body></html>
输出:
2018-07-25 17:19:58 [scrapy.downloadermiddlewares.redirect]
DEBUG: Redirecting (301) to <GET https://dubai.dubizzle.com/classified/> from <GET http://dubai.dubizzle.com/classified/>
2018-07-25 17:20:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dubai.dubizzle.com/classified/> (referer: None)
2018-07-25 17:20:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dubai.dubizzle.com/classified/clothing-accessories/mens-accessories/sunglasses/2018/7/25/oakley-juliet-gun-metal-gray-with-addition-2/?back=L2NsYXNzaWZpZWQv&pos=1 via http://192.168.99.100:8050/render.html> (referer: None)
2018-07-25 17:20:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dubai.dubizzle.com/classified/clothing-accessories/mens-accessories/sunglasses/2018/7/25/oakley-juliet-gun-metal-gray-with-addition-2/?back=L2NsYXNzaWZpZWQv&pos=1>
{'title': None}
2018-07-25 17:20:12 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-25 17:20:12 [scrapy.extensions.feedexport] INFO: Stored csv feed (1 items) in: data.csv
我意识到某种程度上可以检测到它是一个网络爬虫,但是我似乎无法解决该问题。
注意:分类页面上不会发生这种情况,我可以轻松地抓取并获取每个广告的网址。当我请求每个广告的网址时,就会出现问题。
答案 0 :(得分:0)
现在大多数网站上都有Robot.txt
文件,以防止抓取漫游器,从而克服这种情况,您需要访问settings
文件,并将ROBOTSTXT_OBEY
标志转到 {{ 1}} 。
或者如果您想在shell中进行检查,则可以这样指定标志:
False