使用BeautifulSoup4删除所有HTML标记(python 3.4)

时间:2014-07-06 06:31:47

标签: python python-3.x web-scraping beautifulsoup

我一直试图解决这个问题,但我设法做到这一点的唯一方法就是使用复杂的while循环。

我想输入以下内容:

"<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>"

并输出:

"This is a test (to see if this works) and I really hope it does"

从本质上讲,我想删除所有内容&#34;&lt; &GT;&#34;以及它们之间的任何东西。我可以用一些命令做的最好的事情是:

"This is a test (<i> to see </i> this works) and I really hope it does"

然后我离开了这些讨厌的家伙:<i></i>

这是我的代码:

from bs4 import BeautifulSoup

text = "<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>" 
soup = BeautifulSoup(text)
content = soup.find_all("td","ToEx")
content[0].renderContents()

2 个答案:

答案 0 :(得分:2)

只需打印代码的.text属性,即可为其提供文字

print(content[0].text)

输出:

This is a test ( to see  this works) and I really hope it does

答案 1 :(得分:0)

我会使用get_text() - 它是针对这种情况而设计的:

text = "<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>" 
soup = BeautifulSoup(text)
print(soup.get_text())

这应该有效as per the documentation

我之前从未见过.text,而是在Beautiful Soup 4中使用.string - 如果您想使用它:

text="<td colspan='2' class='ToEx'>This is a test (<i> to see </i> this works) and I really hope it does</td>"
soup = BeautifulSoup(text)

for string in soup.strings:
     print(str(string),end="")

两者都会输出:

  

这是一个测试(看到这个工作),我真的希望它能

两者都同样有效,但get_text()会更容易使用,特别是如果你想将文本保存到变量等。