在网站http://www.apkmirror.com/apk/google-inc/sheets/sheets-1-7-152-06-release/google-sheets-1-7-152-06-30-android-apk-download/的Scrapy shell中,我尝试解析version_name
(1.7.152.06.30
)和version_code
(7152063
) (使用Scrapy的MapCompose处理器以简洁的方式查看下面的屏幕截图。
我的第一步是从APK详情'中获取所有文字。部分:
In [2]: apk_details = response.xpath('//*[@title="APK details"]/following-sibling::*[@class="appspec-value"]//text()').extract()
apk_details
列表如下:
[u'Version: 1.7.152.06.30 (71520630)',
u'arm ',
u'Package: com.google.android.apps.docs.editors.sheets',
u'\n',
u'191 downloads ']
我已经定义了以下辅助功能:
import re
def get_version_line(apk_details):
'''Get the line containing the version from the 'APK details' section.'''
return next(line for line in apk_details if line.startswith("Version:"))
def parse_version_line(version_line):
'''Parse the 'versionName' and 'versionCode' from the relevant line in 'APK details'.'''
PATTERN = r"^Version: (?P<version_name>.+) \((?P<version_code>\d+)\)\s*$" # Note that the pattern includes the end-of-line character ($). This is necessary because some package names (e.g. Google Play) themselves contain brackets.
return re.match(PATTERN, version_line).groupdict()
这样version_name
可以如下获得:
In [4]: version_line = get_version_line(apk_details)
In [5]: version_line
Out[5]: u'Version: 1.7.152.06.30 (71520630)'
In [6]: groups = parse_version_line(version_line)
In [7]: groups
Out[7]: {'version_code': u'71520630', 'version_name': u'1.7.152.06.30'}
In [8]: version_name = groups.get("version_name")
In [9]: version_name
Out[9]: u'1.7.152.06.30'
换句话说,我希望将get_version_line
,parse_version_line
和lambda d: d.get("version_name")
连续应用于apk_details
。但是,如果我尝试以下内容:
In [10]: proc = MapCompose(get_version_line, parse_version_line)
In [11]: proc(apk_details)
我收到StopIteration
例外:
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-11-59a0bd60721d> in <module>()
----> 1 proc(apk_details)
/usr/local/lib/python2.7/dist-packages/scrapy/loader/processors.pyc in __call__(self, value, loader_context)
26 next_values = []
27 for v in values:
---> 28 next_values += arg_to_iter(func(v))
29 values = next_values
30 return values
/home/kurt/dev/apkmirror_scraper/apkmirror_scraper/items.pyc in get_version_line(apk_details)
33 def get_version_line(apk_details):
34 '''Get the line containing the version from the 'APK details' section.'''
---> 35 return next(line for line in apk_details if line.startswith("Version:"))
36
37 def get_architectures_line(apk_details):
StopIteration:
在这种情况下,我如何正确使用MapCompose
?
答案 0 :(得分:1)
使用Compose代替re.MULTILINE标志:
import re
from scrapy.loader.processors import Compose
def parse_version_line(version_line):
"""Parse the 'versionName' and 'versionCode' from the relevant line in 'APK details'."""
text = '\n'.join(version_line)
PATTERN = r"^Version: (?P<version_name>.+) \((?P<version_code>\d+)\)\s*$" # Note that the pattern includes the end-of-line character ($). This is necessary because some package names (e.g. Google Play) themselves contain brackets.
return re.match(PATTERN, text, re.MULTILINE).groupdict()
尝试一下:
data = [u'Version: 1.7.152.06.30 (71520630)',
u'arm ',
u'Package: com.google.android.apps.docs.editors.sheets',
u'\n',
u'191 downloads ']
m = Compose(parse_version_line)
print(m(data))
# {'version_name': u'1.7.152.06.30', 'version_code': u'71520630'}
答案 1 :(得分:0)
基于Granitosaurus'回答,我发现解决方案只是使用Scrapy的Compose处理器而不是MapCompose:
In [26]: proc = Compose(get_version_line, parse_version_line, lambda d: d.get("version_name"))
In [27]: print proc(apk_details)
1.7.152.06.30
仔细阅读文档后,这是有道理的:广义上讲,Compose
在“整个”输入上生成给定函数的组合,而MapCompose
在每个元素上执行组合函数。