Question

我想在每个div class="summary"中提取网络内容。在每个summary div中，我想在div中的每个类中提取数据。

以下是我的摘录。

questions = Selector(response).xpath('//div[@class="summary"]')
for question in questions:
    item = StackItem()
    # get the hyperlink of h3 text
    item['title'] = question.xpath('a[@h3]/text()').extract()[0]
    yield item

我应该如何在代码中编写xpath contenct？

Answer 1

您的第二个XPath会查找 a元素，它是div[@class="summary"]的直接子元素，并且具有属性h3 ，这在发布的HTML中不存在。

从a h3获取div元素的正确XPath如下：

h3/a/text()

Answer 2

另一种表达方式可能是：

questions = Selector(response).xpath('div[@class="summary"]/h3')

并从<a>获取数据：

item['title'] = question.xpath('/a/text()').extract()[0]

如果要提取的所有数据都在h3标记内，这非常有用。

Scrapy，如何提取h3内容？

2 个答案: