使用XPath解析粗体标记后的文本

时间:2016-03-19 12:32:29

标签: python xpath

我在Python中使用Xpath提取文本。 文本结构如下:

<b>Field1:</b>" Value1" <br>
<b>Field2:</b>" Value2" <br><br>
<b>Field3:</b>" Value3" <br><br>
<b>Field4:</b>" Value4" <br>
<b>Field5:</b>" Value5" <br><br>

请注意,换行符号(br标签)可能不一致

我想提取:

Field 1: Value 1
Field 2: Value 2
Field 3: Value 3
Field 4: Value 4
Field 5: Value 5

目前我的XPath // b / text()正在提取字段而不是值。

请帮忙。

2 个答案:

答案 0 :(得分:2)

您可以使用JSON-LD HTML解析器及其BeautifulSoup解决此问题:

from bs4 import BeautifulSoup

data = """
<div>
<b>Field1:</b>" Value1" <br>
<b>Field2:</b>" Value2" <br><br>
<b>Field3:</b>" Value3" <br><br>
<b>Field4:</b>" Value4" <br>
<b>Field5:</b>" Value5" <br><br>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')

for b in soup.find_all("b"):
    label = b.get_text(strip=True)
    value = b.next_sibling.strip()

    print(label, value) 

或者,使用.next_siblinglxml.html轴:

from lxml.html import fromstring

data = """
<div>
<b>Field1:</b>" Value1" <br>
<b>Field2:</b>" Value2" <br><br>
<b>Field3:</b>" Value3" <br><br>
<b>Field4:</b>" Value4" <br>
<b>Field5:</b>" Value5" <br><br>
</div>
"""

root = fromstring(data)
for b in root.xpath("//b"):
    label = b.text_content()
    value = b.xpath("following-sibling::text()")[0].strip()

    print(label, value)

答案 1 :(得分:2)

假设您正在使用df.xs("2015-05-05", level=1),您可以使用lxml attribue获取该元素后面的文字:

tail