Question

嗨我在scrapy中有点像菜鸟。我试图从以下页面抓取文章（内容，代理商名称，通讯员等）： http://timesofindia.indiatimes.com/topic/Startup

问题是我的蜘蛛为大多数文章返回了正确的结果，但对于代理商名称为“路透社”的文章（例如 - http://timesofindia.indiatimes.com/business/international-business/novartis-roche-back-french-gene-therapy-start-up-vivet/articleshow/58511702.cms），它只返回一堆转义字符而不是内容（它确实返回标题和代理商名称）。这是我的xpath变量：

main_path=response.xpath('//div[@class="main-content"]')

yield {

'Headline':"".join(main_path.xpath('.//h1[@class="heading1"]/text()').extract(),

'Correspondent':"".join(main_path.xpath('.//span[@class="auth_detail"]/text()').extract()),

'Agency':"".join(main_path.xpath('.//span[@itemprop="name"]/text()').extract()),

'ArticleContent':(main_path.xpath('.//div[@class="Normal"]/text()').extract()),

}

你能帮助我弄清楚为什么我会面对这个问题？感谢

Answer 1

解决方案：在/之前插入第二个text()到xpath

'ArticleContent':(main_path.xpath('.//div[@class="Normal"]//text()').extract()),

<强>解释

路透社的文章内容中还有其他<p>个标签。虽然../text()仅捕获同一节点/标记..//text()内的文本，但也会为子标记/子节点捕获文本。

使用Scrapy抓取特定网页

1 个答案: