Question

我在变量中有以下html内容，需要通过删除内部标记来从html中读取文本的方法 html=<td class="row">India (ASIA) (<a href="/asia/india">india</a> – <a href="/asia/india">photos</a>)</td>

我只想用BeautifulSoup从中提取字符串India (ASIA)。是否有可能或应该使用正则表达式。

Answer 1

这是使用beautifulsoup的一种可能方式，通过在子元素<a>之前提取文本内容：

from bs4 import BeautifulSoup

html = """<td class="row">India (ASIA) (<a href="/asia/india">india</a>&nbsp;–&nbsp;<a href="/asia/india">photos</a>)</td>"""
soup = BeautifulSoup(html)
result = soup.find("a").previousSibling
print(result.decode('utf-8'))

输出

India (ASIA) (

_{进一步调整代码以从(删除尾随result应该是直截了当的}

我如何用texts中的内部标签解析一个html字符串

1 个答案: