Question

我有一个看起来像这样的html页面：

<div>
<h1>First Item</h1>
<p> the text I want </p>
</div>

<div>
<h1>Second Item</h1>
<p> the text I don't want </p>
</div>

每个页面抓取中“第一项”的标题可能位于不同的标签级别，因此索引不固定。

我想要一些看起来像（是伪代码）的选择。

from lxml import html

locate_position = locate(html.xpath(//div/h1[contains("First Item")])))

scrape = html.xpath(//div[locate_position]/p)

Answer 1

如果您只想匹配前面的兄弟姐妹：

/p/preceding-sibling::contains(h1,"First Item")

更接近您的示例的选项是：

/div[contains(h1, "First Item")]/p

哪个是div的子对象中有p个子对象的子对象？

Answer 2

如果您准备考虑使用bs4 4.7.1，这很容易。您可以使用:contains pseudo class来指定h1必须包含搜索字符串，并使用adjacent sibling combinator来指定匹配项必须紧随其后的p标签。

相邻的同级组合器（+）分隔两个选择器和仅当第二个元素紧跟在第一个元素之后才匹配元素，并且都是相同父元素的子元素。

from bs4 import BeautifulSoup as bs

html = '''
<div>
<h1>First Item</h1>
<p> the text I want </p>
</div>

<div>
<h1>Second Item</h1>
<p> the text I don't want </p>
</div>
'''

soup = bs(html, 'lxml')

#multiple matches possible
matches = [match.text for match in soup.select('h1:contains("First Item") + p')]
print(matches)

# first match (useful if only one match expected or first required)
soup.select_one('h1:contains("First Item") + p').text

如何通过之前的标签内容选择标签？

2 个答案: