在Scrapy中,无法提取带有“@”的链接文本

时间:2017-04-20 08:43:51

标签: python xpath scrapy

在URL http://www.apkmirror.com/apk/sony-mobile-communications/foldinghome/foldinghome-1-00-40-release/foldinghome-1-00-40-android-apk-download/的Scrapy shell中,我试图从导航栏中提取开发人员,应用和版本名称:

enter image description here

我尝试过以下XPath选择器:

In [6]: response.xpath('//*[@class="breadcrumbs"]//a/text()').extract()
Out[6]: [u'Sony Mobile Communications', u'1.00.40']

但请注意,应用名称Folding@Home不在结果中。我不明白这一点,因为它似乎有一个<a>标签(如Chrome中的“Inspect”所示):

enter image description here

此外,对于类似的网站http://www.apkmirror.com/apk/oculus-vr/oculus-rooms/oculus-rooms-0-0-2-release/oculus-rooms-0-0-2-android-apk-download/,此选择器可以正常工作:

In [1]: response.xpath('//*[@class="breadcrumbs"]//a/text()').extract()
Out[1]: [u'Oculus VR', u'Oculus Rooms', u'0.0.2']

我开始怀疑这可能是Scrapy中的某种错误,因为它没有选择带有text()符号的<a>@元素。情况可能会这样吗?

2 个答案:

答案 0 :(得分:1)

使用Chrome浏览器查看页面来源&#34;查看页面来源&#34;选项而不是&#34; Inspect&#34;,我看到此特定链接的导航栏包含JavaScript:

<nav style="margin-left:16px; margin-right:16px;" class="navbar navbar-default" role="navigation">
<div style="color: #013967 !important;" class="breadcrumbs"><a class="withoutripple" style="color: #013967 !important;" href="/apk/sony-mobile-communications/">Sony Mobile Communications</a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="withoutripple " style="color: #013967 !important;" href="/apk/sony-mobile-communications/foldinghome/"><span class="__cf_email__" data-cfemail="c781a8aba3aea9a0878fa8aaa2">[email&#160;protected]</span><script data-cfhash='f9e31' type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script></a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="active withoutripple" style="color: #013967 !important;" href="/apk/sony-mobile-communications/foldinghome/foldinghome-1-00-40-release/">1.00.40</a> </nav>

而对于第二个例子中的Oculus Rooms页面,它并不是:

<nav style="margin-left:16px; margin-right:16px;" class="navbar navbar-default" role="navigation">
<div style="color: #646464 !important;" class="breadcrumbs"><a class="withoutripple" style="color: #646464 !important;" href="/apk/oculus-vr/">Oculus VR</a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="withoutripple " style="color: #646464 !important;" href="/apk/oculus-vr/oculus-rooms/">Oculus Rooms</a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="active withoutripple" style="color: #646464 !important;" href="/apk/oculus-vr/oculus-rooms/oculus-rooms-0-0-2-release/">0.0.2</a> </nav>

使用Scrapy处理JavaScript是一个已知问题(参见https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/)。

答案 1 :(得分:1)

正如您已经想到的那样,其中一个面包屑链接是“受保护的”,并且是通过浏览器中执行的JavaScript动态构建的。

解决问题的一种简单方法是通过Splash通过scrapy-splash中间件传递页面内容。这对我有用:

import scrapy
from scrapy_splash import SplashRequest


class ApkSpider(scrapy.Spider):
    name = "apkmirror"
    allowed_domains = ['apkmirror.com']

    def start_requests(self):
        yield SplashRequest(
            'http://www.apkmirror.com/apk/sony-mobile-communications/foldinghome/foldinghome-1-00-40-release/foldinghome-1-00-40-android-apk-download/',
            self.parse_result,
            )

    def parse_result(self, response):
        print(response.xpath('//*[@class="breadcrumbs"]//a/text()').extract())

使用以下设置:

SPLASH_URL = 'http://127.0.0.1:8050'
SPLASH_COOKIES_DEBUG = True

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Splash在端口8050上的docker容器中运行。

打印:

[u'Sony Mobile Communications', u'Folding@Home', u'1.00.40']