Question

这是我试过的URL。我试图获得文章的正文内容; “在电视中共同观看......”。我尝试过以下表达式：

[In 1]:response.xpath("//*[contains(@class, 'text parbase')]//text()").extract()
[Out 1]:[]

[In 2]:response.xpath("//*[contains(@class, 'text')]//text()").extract()
[Out 2]: [u'\n',
 u'\n',
 u'\n\n',
 u'\n    $CQ(function() {\n        CQ_Analytics.SegmentMgr.loadSegments("/etc/segmentation");\n         CQ_Analytics.ClientContextUtils.init("","/content/corporate/us/en/insights/journal-of-measurement/volume-1-issue-2/nott-alone-is-ott-making-it-cool-again-to-watch-tv-together");\n\n        \n    });\n',
 u'\n']

[In 3]:response.xpath("//p//text()").extract()
[Out 3]:[u'X']

它们似乎都没有包含我想要的东西。我在这里做错了吗？如果已经回答了这个问题，我很抱歉，我已尽力找到答案，但还没有找到任何答案。任何帮助将不胜感激。谢谢！

Answer 1

网站的HTML输出似乎存在某种问题，而Scrapy解析器无法呈现该部分。您可以使用常规表达式提取内容以获得修复：

import re
from scrapy import Selector

section = re.match(r'.*(<div.*?parbase toptext.*?)</div>', response.body, re.DOTALL).group(1)
Selector(text=section).xpath('//text()').extract()

Answer 2

从我可以看到该页面包含以下行：

myView.frame = CGRect(x: 0, y: statusBarHeight - 20, width: 100, height: 100)

其中<li><script src="https://apis.google.com/js/platform.js" asyncdefer=[NULL][NULL]代表空字节。

这似乎摒弃了解析器。如果我使用删除了空字节的响应主体构造一个选择器，那么它就可以工作。

Scrapy xpath不工作（也许是parbase的东西？）

2 个答案: