Question

我一直试图解决这个问题，但我设法做到这一点的唯一方法就是使用复杂的while循环。

我想输入以下内容：

"<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>"

并输出：

"This is a test (to see if this works) and I really hope it does"

从本质上讲，我想删除所有内容＆＃34;＆lt; ＆GT;＆＃34;以及它们之间的任何东西。我可以用一些命令做的最好的事情是：

"This is a test (<i> to see </i> this works) and I really hope it does"

然后我离开了这些讨厌的家伙：<i></i>

这是我的代码：

from bs4 import BeautifulSoup

text = "<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>" 
soup = BeautifulSoup(text)
content = soup.find_all("td","ToEx")
content[0].renderContents()

Answer 1

只需打印代码的.text属性，即可为其提供文字

print(content[0].text)

输出：

This is a test ( to see  this works) and I really hope it does

Answer 2

我会使用get_text() - 它是针对这种情况而设计的：

text = "<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>" 
soup = BeautifulSoup(text)
print(soup.get_text())

这应该有效as per the documentation。

我之前从未见过.text，而是在Beautiful Soup 4中使用.string - 如果您想使用它：

text="<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>"
soup = BeautifulSoup(text)

for string in soup.strings:
     print(str(string),end="")

两者都会输出：

这是一个测试（看到这个工作），我真的希望它能

两者都同样有效，但get_text()会更容易使用，特别是如果你想将文本保存到变量等。

使用BeautifulSoup4删除所有HTML标记（python 3.4）

2 个答案: