Question

我正在使用Scrapy从网站上提取有关音乐会的一些数据。我正在使用的至少一个网站使用（错误地，根据W3C - Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?）h1元素中的p元素。我需要在p元素中提取文本，但无法弄清楚如何。

我已阅读文档并查看了示例用法，但对Scrapy来说相对较新。我理解解决方案与将Selector类型设置为“xml”而不是“html”以识别任何XML树有关，但对于我的生活，我无法弄清楚在这种情况下如何或在何处执行此操作。

例如，网站具有以下HTML：

<h1 class="performance-title">
<p>Bernard Haitink conducts Brahms and&nbsp;Dvořák featuring pianist     Emanuel Ax
</p>
</h1>

我创建了一个名为Concert（）的项目，其中包含一个名为“title”的值。在我的项目加载器中，我使用：

def parse_item(self, response):       
    thisconcert = ItemLoader(item=Concert(), response=response)
    thisconcert.add_xpath('title','//h1[@class="performance-title"]/p/text()')

    return thisconcert.load_item()

这在项目['title']中返回一个unicode列表，该列表不包含p元素内的文本，例如：

['\n                 ', '\n                 ', '\n                ']

我理解为什么，但我不知道如何绕过它。我也尝试过这样的事情：

from scrapy import Selector

def parse_item(self, response):  

    s = Selector(text=' '.join(response.xpath('.//section[@id="performers"]/text()').extract()), type='xml')

我在这里做错了什么，如何解析包含此问题的HTML（p在h1内）？

我在Behavior of the scrapy xpath selector on h1-h6 tags引用了有关此特定问题的信息，但它没有提供可应用于蜘蛛的完整解决方案，只是使用给定文本字符串的会话中的示例。

Answer 1

这太令人困惑了。坦率地说，我仍然不明白为什么会发生这种情况。发现应该包含在<p>标记中的<h1>标记并非如此。网站的卷曲显示<h1><p> </p></h1>格式，而从网站获取的响应则显示为：

<h1 class="performance-title">\n</h1>
<p>Bernard Haitink conducts Brahms and\xa0Dvo\u0159\xe1k featuring\npianist Emanuel Ax
</p>

正如我所提到的，我确实怀疑但没有具体。无论如何，用于获取<p>标签内的文本的 xpath 是：

response.xpath('//h1[@class="performance-title"]/following-sibling::p/text()').extract()

这是使用<h1 class="performance-title">作为地标并找到其兄弟<p>标记

Answer 2

//*[@id="content"]/section/article/section[2]/h1/p/text()

使用Python / Scrapy在h1中提取p

2 个答案: