Question

我正在使用lxml解析Real-World HTML文件。这意味着，我想从标签中提取信息，而我无法控制风格。问题是我在数据中存在谎言。

<fieldset>
  <legend>
    <strong>Notes</strong>
  </legend>
  Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h) 
</fieldset>

问题是由于标志＆lt;在数据中，lxml的HTML解析器将跳过文本和结束标签，但这正是我想要提取的文本。我可以使用任何解决方案来获取此标签的文本吗？

Answer 1

HTML实际上是broken one。

您可以使用BeautifulSoup和宽松的html5lib解析器解析它：

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup


data = u"""
<fieldset>
  <legend>
    <strong>Notes</strong>
  </legend>
  Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h)
</fieldset>
"""

soup = BeautifulSoup(data, "html5lib")
print(soup.fieldset.legend.next_sibling.strip())

打印：

Slav *kǫda 'thither', kǫdě   'where, whither' < IE *k(w)om-d(h)

Python使用lxml解析html：获取标记文本，而特定符号会导致问题

1 个答案: