Question

问题： 我有以下XML代码段：

...snip...
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>
..snip...

我需要搜索XML的全部，找到包含文本DEFINITION的标题，并打印相关的定义。定义的格式不一致，并且可以更改属性/元素，因此捕获所有元素/内容的唯一可靠方法是读取直到具有属性p_cat_heading的下一个元素。

现在我正在使用以下代码查找所有标头：

for heading in root.findall(".//*[@class='p_cat_heading']"):
    if heading.text == "DEFINITION":
        <WE FOUND THE CORRECT HEADER - TAKE ACTION HERE>

我尝试过的事情：

使用lxml的getnext方法。这将获得下一个具有“ p_cat_heading”属性的兄弟姐妹，这不是我想要的。
following_sibling-应该支持lxml，但它会引发“在前缀映射中找不到以下兄弟姐妹”

我的解决方案：

我还没有完成它，但是因为我的XML很短，所以我只是要获取所有元素的列表，迭代直到具有DEFINITION属性的元素，然后迭代直到具有p_cat_heading属性的下一个元素。这个解决方案可怕又丑陋，但我似乎找不到一个干净的替代方法。

我在寻找什么

在我们的例子中，一种更Python化的方式来打印定义为“ this，these”的定义。解决方案可以使用xpath或其他替代方法。首选Python本机解决方案，但任何方法都可以。

Answer 1

您可以将BeatifulSoup与CSS选择器一起使用以完成此任务。选择器.p_cat_heading:contains("DEFINITION") ~ .p_cat_heading将选择所有类p_cat_heading的元素，这些元素之前是类p_cat_heading的元素，其中包含字符串“ DEFINITION”：

data = '''
<p class="p_cat_heading">THIS YOU DONT WANT</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

for heading in soup.select('.p_cat_heading:contains("DEFINITION") ~ .p_cat_heading'):
    print(heading.text)

打印：

PRONUNCIATION

进一步阅读

CSS Selector guide

编辑：

要在定义后选择直接同级，请执行以下操作：

data = '''
<p class="p_cat_heading">THIS YOU DONT WANT</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This is after DEFINITION</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>
<p class="p_numberedbullet"><span class="calibre10">This is after PRONUNCIATION</span>, <span class="calibre10">these</span>. </p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

s = soup.select_one('.p_cat_heading:contains("DEFINITION") + :not(.p_cat_heading)')
print(s.text)

打印：

This is after DEFINITION, these.

Answer 2

有两种方法可以执行此操作，但是通过依赖xpath来完成大部分工作，该表达式

//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]

应该工作。

使用lxml：

from lxml import html

data = [your snippet above]
exp = "//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]"

tree = html.fromstring(data) 
target = tree.xpath(exp)

for i in target:
    print(i.text_content())

输出：

这些，这些。

在XML中找到元素同级的最Pythonic方法

2 个答案: