包含特定字符串

时间:2017-04-25 16:05:26

标签: python xpath scrapy

http://www.apkmirror.com/apk/redditinc/reddit/reddit-1-5-5-release/reddit-1-5-5-android-apk-download/网站上,我试图提取包含Android Min:Target:版本的行(请参阅下面的屏幕截图)。

enter image description here

在Scrapy shell中,到目前为止,我已经提出了XPath表达式

In [1]: android_version = response.xpath('//*[@title="Android version"]/following-sibling::*[@class="appspec-value"]')

如果我与.//text()extract()连接,我会得到几行,包括我想要的那些:

In [2]: android_version_text = android_version.xpath('.//text()').extract()

In [3]: android_version_text
Out[3]: 
[u'\n',
 u'Min: Android 4.0.3 (Ice Cream Sandwich MR1, API 15) ',
 u'\n',
 u'Target: Android 6.0 (Marshmallow, API 23)',
 u'\n']

我现在想要优化XPath表达式,只获取text()包含"Min:""Target:的字段。在XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode之后,我尝试了

In [7]: android_version.xpath('.//*[contains(text(), "Min:"]')

但这会产生一个

ValueError: XPath error: Invalid expression in .//*[contains(text(), "Min:"]

如何构建XPath表达式以仅获取Min:行,例如?

1 个答案:

答案 0 :(得分:0)

关注https://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/,我想出了以下内容:

In [12]: android_min_version = response.xpath('//*[@title="Android version"]/following-sibling::*[@class="appspec-value"]//text()[starts-with(., "Min:")]')

In [13]: android_min_version.extract()
Out[13]: [u'Min: Android 4.0.3 (Ice Cream Sandwich MR1, API 15) ']

简而言之,要过滤您想要的文字,请执行普通//text()后跟[contains(., "target_string")],其中"target_string"是您要搜索的字符串。 (此处我还使用了starts-with代替contains)。