Scrapy-Splash呈现不同的HTML

时间:2018-07-25 13:33:59

标签: python web-scraping scrapy scrapy-splash

我最近一直在尝试抓取一个电子商务网站。一开始,我一直重定向到“您是机器人吗?”页。然后,我开始使用浏览器用户代理,用于Javascript的scrapy-splash和5秒的下载延迟。现在,没有错误,但是没有呈现正确的页面。

spider.py

class ClassifiedsSpider(scrapy.Spider):
name = 'classifieds'
allowed_domains = ['dubai.dubizzle.com']
start_urls = ['http://dubai.dubizzle.com/classified/']

def parse(self, response):
    url = response.xpath("//h3[@id='title']/span[@class='title']/a/@href").extract_first()
    yield SplashRequest(url, self.get_details,
        endpoint='render.html',
        args={'wait': 0.5, 'proxy':'http://89.212.66.36:8080'},
    )

def get_details(self, response):
    title = response.xpath("//h1[@id='title']/span[@id='listing-title-wrap']").extract_first()
    item = Product()
    item['title'] = title        
    yield item
    with open("body.txt", "a") as f:
        f.write(response.body.decode("utf-8"))

呈现的HTML

<!DOCTYPE html><html><head>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0">
<meta http-equiv="cache-control" content="no-cache">
<meta http-equiv="expires" content="0">
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT">
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=69660faf-3162-4101-9087-7d1dd8930125&amp;httpReferrer=%2Fclassified%2Ffurniture-home-garden%2Ffurniture%2Fsofas-futons-lounges%2F2018%2F7%2F25%2Fpure-leather-sofa-2%2F%3Fback%3DL2NsYXNzaWZpZWQv%26pos%3D1">
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script type="text/javascript" src="/dstldbzzlxhr.js" defer=""></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#yuatbtfwcdza{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock">&nbsp;</div>


<div id="d__fFH" style="position: absolute; top: -5000px; left: -5000px;"><object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object><span id="d__fF" style="font-family: Courier, serif; font-size: 200px; visibility: hidden;">The quick brown fox jumps over the lazy dog.</span></div></body></html>

输出:

2018-07-25 17:19:58 [scrapy.downloadermiddlewares.redirect] 
DEBUG: Redirecting (301) to <GET https://dubai.dubizzle.com/classified/> from <GET http://dubai.dubizzle.com/classified/>
2018-07-25 17:20:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dubai.dubizzle.com/classified/> (referer: None)
2018-07-25 17:20:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dubai.dubizzle.com/classified/clothing-accessories/mens-accessories/sunglasses/2018/7/25/oakley-juliet-gun-metal-gray-with-addition-2/?back=L2NsYXNzaWZpZWQv&pos=1 via http://192.168.99.100:8050/render.html> (referer: None)
2018-07-25 17:20:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dubai.dubizzle.com/classified/clothing-accessories/mens-accessories/sunglasses/2018/7/25/oakley-juliet-gun-metal-gray-with-addition-2/?back=L2NsYXNzaWZpZWQv&pos=1>
{'title': None}
2018-07-25 17:20:12 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-25 17:20:12 [scrapy.extensions.feedexport] INFO: Stored csv feed (1 items) in: data.csv

我意识到某种程度上可以检测到它是一个网络爬虫,但是我似乎无法解决该问题。

注意:分类页面上不会发生这种情况,我可以轻松地抓取并获取每个广告的网址。当我请求每个广告的网址时,就会出现问题。

1 个答案:

答案 0 :(得分:0)

现在大多数网站上都有Robot.txt文件,以防止抓取漫游器,从而克服这种情况,您需要访问settings文件,并将ROBOTSTXT_OBEY标志转到 {{ 1}}

或者如果您想在shell中进行检查,则可以这样指定标志:

False