我要重构一个蜘蛛,我已经写过去下载诸如http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/之类的APK页面。到目前为止这是蜘蛛:
DEBUG = True
import scrapy
from scrapy.spiders import SitemapSpider
from apkmirror_scraper.items import ApkmirrorScraperItem, ApkmirrorItemLoader
class ApkmirrorSitemapSpider(SitemapSpider):
name = 'apkmirror-spider'
sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]
if DEBUG:
custom_settings = {'CLOSESPIDER_PAGECOUNT': 20,
'CLOSESPIDER_ERRORCOUNT': 0,
'CONCURRENT_REQUESTS': 16,
'CONCURRENT_REQUESTS_PER_DOMAIN': 8}
def parse(self, response):
l = ApkmirrorItemLoader(item=ApkmirrorScraperItem(), response=response)
l.add_value('url', response.url)
l.add_xpath(field_name='title', xpath='//h1[@title]/text()')
l.add_xpath(field_name='developer', xpath='//h3[@title]/a/text()')
l.add_xpath(field_name='app', xpath='//*[contains(@data-channel-name, "App Updates")]/@data-channel-name')
return l.load_item()
我正在尝试将项目字段的处理和解析移动到items.py
:
import re
import scrapy
import scrapy.loader
from scrapy.loader.processors import MapCompose, TakeFirst
class ApkmirrorScraperItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
developer = scrapy.Field()
app = scrapy.Field()
def parse_app(data_channel_name):
'''Parse the name of the app from the "data-channel-name" attribute of the button named "Follow [app_name] Updates".'''
pattern = re.compile(r'(?P<app>.+) App Updates')
return pattern.search(data_channel_name).groupdict().get("app")
class ApkmirrorItemLoader(scrapy.loader.ItemLoader):
url_out = TakeFirst()
title_in = MapCompose(unicode.strip)
title_out = TakeFirst()
developer_in = MapCompose(unicode.strip)
developer_out = TakeFirst()
app_out = MapCompose(parse_app)
目前,如果我爬行蜘蛛,它会刮掉这样的物品:
2017-04-24 19:30:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap5.xml)
2017-04-24 19:30:57 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/>
{'app': [u'Adobe Photoshop Mix'],
'developer': u'Adobe',
'title': u'Adobe Photoshop Mix 1.0.333 beta (arm)',
'url': 'http://www.apkmirror.com/apk/adobe/photoshop-mix/photoshop-mix-1-0-333-release/adobe-photoshop-mix-1-0-333-beta-android-apk-download/'}
请注意,'app'
字段仍然是一个列表,我仍然希望将Scrapy的TakeFirst()
处理器应用到该列表中。但是,如果我尝试将相关行更改为
app_out = MapCompose(parse_app, TakeFirst())
我得到的内容如下:
2017-04-24 19:44:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/microsoft-corporation/powerpoint/powerpoint-16-0-6228-1008-release/powerpoint-16-0-6228-1008-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap12.xml)
2017-04-24 19:44:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/microsoft-corporation/powerpoint/powerpoint-16-0-6228-1008-release/powerpoint-16-0-6228-1008-android-apk-download/>
{'app': [u'M'],
'developer': u'Microsoft Corporation',
'title': u'Microsoft PowerPoint 16.0.6228.1008 (arm)',
'url': 'http://www.apkmirror.com/apk/microsoft-corporation/powerpoint/powerpoint-16-0-6228-1008-release/powerpoint-16-0-6228-1008-android-apk-download/'}
其中app
为'M'
而不是'Microsoft PowerPoint'
。换句话说,似乎TakeFirst()
正在取字符串的第一个字母而不是列表中的第一个字母。如果我尝试将订单切换到MapCompose(TakeFirst(), parse_app)
,那么我会收到类似
2017-04-24 19:49:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/google-inc/google/google-6-8-0-107974459-release/google-6-8-0-107974459-android-4-0-3-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap13.xml)
2017-04-24 19:49:15 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.apkmirror.com/apk/google-inc/google/google-6-8-0-107974459-release/google-6-8-0-107974459-android-4-0-3-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap13.xml)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/kurt/dev/apkmirror_scraper/apkmirror_scraper/spiders/sitemap_spider.py", line 43, in parse
return l.load_item()
File "/usr/local/lib/python2.7/dist-packages/scrapy/loader/__init__.py", line 115, in load_item
value = self.get_output_value(field_name)
File "/usr/local/lib/python2.7/dist-packages/scrapy/loader/__init__.py", line 128, in get_output_value
(field_name, self._values[field_name], type(e).__name__, str(e)))
ValueError: Error with output processor: field='app' value=[u'Google+ App Updates'] error='AttributeError: 'NoneType' object has no attribute 'groupdict''
换句话说,parse_app
方法失败。
如何将TakeFirst()
纳入ItemLoader
?
答案 0 :(得分:0)
我设法通过将自定义分析方法用作输入处理器并将TakeFirst()
用作输出处理器来实现所需的结果:
app_in = MapCompose(parse_app)
app_out = TakeFirst()
刮下的字段现在就像
2017-04-24 19:55:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.apkmirror.com/apk/microsoft-corporation/excel/excel-16-0-6228-1008-release/excel-16-0-6228-1008-android-apk-download/> (referer: http://www.apkmirror.com/apps_post-sitemap12.xml)
2017-04-24 19:55:12 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.apkmirror.com/apk/microsoft-corporation/excel/excel-16-0-6228-1008-release/excel-16-0-6228-1008-android-apk-download/>
{'app': u'Microsoft Excel',
'developer': u'Microsoft Corporation',
'title': u'Microsoft Excel 16.0.6228.1008 (arm)',
'url': 'http://www.apkmirror.com/apk/microsoft-corporation/excel/excel-16-0-6228-1008-release/excel-16-0-6228-1008-android-apk-download/'}
使用应用的全名。