如何使用Scrapy的MapCompose输入处理器

时间:2017-04-25 11:07:33

标签: python scrapy

在网站http://www.apkmirror.com/apk/google-inc/sheets/sheets-1-7-152-06-release/google-sheets-1-7-152-06-30-android-apk-download/的Scrapy shell中,我尝试解析version_name1.7.152.06.30)和version_code7152063) (使用Scrapy的MapCompose处理器以简洁的方式查看下面的屏幕截图。

enter image description here

我的第一步是从APK详情'中获取所有文字。部分:

In [2]: apk_details = response.xpath('//*[@title="APK details"]/following-sibling::*[@class="appspec-value"]//text()').extract()

apk_details列表如下:

[u'Version: 1.7.152.06.30 (71520630)',
 u'arm ',
 u'Package: com.google.android.apps.docs.editors.sheets',
 u'\n',
 u'191 downloads ']

我已经定义了以下辅助功能:

import re

def get_version_line(apk_details):
    '''Get the line containing the version from the 'APK details' section.'''
    return next(line for line in apk_details if line.startswith("Version:"))

def parse_version_line(version_line):
    '''Parse the 'versionName' and 'versionCode' from the relevant line in 'APK details'.'''
    PATTERN = r"^Version: (?P<version_name>.+) \((?P<version_code>\d+)\)\s*$"       # Note that the pattern includes the end-of-line character ($). This is necessary because some package names (e.g. Google Play) themselves contain brackets.
    return re.match(PATTERN, version_line).groupdict()

这样version_name可以如下获得:

In [4]: version_line = get_version_line(apk_details)

In [5]: version_line
Out[5]: u'Version: 1.7.152.06.30 (71520630)'

In [6]: groups = parse_version_line(version_line)

In [7]: groups
Out[7]: {'version_code': u'71520630', 'version_name': u'1.7.152.06.30'}

In [8]: version_name = groups.get("version_name")

In [9]: version_name
Out[9]: u'1.7.152.06.30'

换句话说,我希望将get_version_lineparse_version_linelambda d: d.get("version_name")连续应用于apk_details。但是,如果我尝试以下内容:

In [10]: proc = MapCompose(get_version_line, parse_version_line)

In [11]: proc(apk_details)

我收到StopIteration例外:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-11-59a0bd60721d> in <module>()
----> 1 proc(apk_details)

/usr/local/lib/python2.7/dist-packages/scrapy/loader/processors.pyc in __call__(self, value, loader_context)
     26             next_values = []
     27             for v in values:
---> 28                 next_values += arg_to_iter(func(v))
     29             values = next_values
     30         return values

/home/kurt/dev/apkmirror_scraper/apkmirror_scraper/items.pyc in get_version_line(apk_details)
     33 def get_version_line(apk_details):
     34     '''Get the line containing the version from the 'APK details' section.'''
---> 35     return next(line for line in apk_details if line.startswith("Version:"))
     36 
     37 def get_architectures_line(apk_details):

StopIteration:

在这种情况下,我如何正确使用MapCompose

2 个答案:

答案 0 :(得分:1)

使用Compose代替re.MULTILINE标志:

import re
from scrapy.loader.processors import Compose


def parse_version_line(version_line):
    """Parse the 'versionName' and 'versionCode' from the relevant line in 'APK details'."""
    text = '\n'.join(version_line)
    PATTERN = r"^Version: (?P<version_name>.+) \((?P<version_code>\d+)\)\s*$"  # Note that the pattern includes the end-of-line character ($). This is necessary because some package names (e.g. Google Play) themselves contain brackets.
    return re.match(PATTERN, text, re.MULTILINE).groupdict()

尝试一下:

data = [u'Version: 1.7.152.06.30 (71520630)',
        u'arm ',
        u'Package: com.google.android.apps.docs.editors.sheets',
        u'\n',
        u'191 downloads ']
m = Compose(parse_version_line)
print(m(data))
# {'version_name': u'1.7.152.06.30', 'version_code': u'71520630'}

答案 1 :(得分:0)

基于Granitosaurus'回答,我发现解决方案只是使用Scrapy的Compose处理器而不是MapCompose

In [26]: proc = Compose(get_version_line, parse_version_line, lambda d: d.get("version_name"))

In [27]: print proc(apk_details)
1.7.152.06.30

仔细阅读文档后,这是有道理的:广义上讲,Compose在“整个”输入上生成给定函数的组合,而MapCompose在每个元素上执行组合函数。