我想知道如何使用BeautifulSoup
删除所有HTML标记及其内容。
输入:
... text <strong>ha</strong> ... text
输出:
... text ... text
答案 0 :(得分:18)
使用replace_with()
(或replaceWith()
):
from bs4 import BeautifulSoup, Tag
text = "text <strong>ha</strong> ... text"
soup = BeautifulSoup(text)
for tag in soup.find_all('strong'):
tag.replaceWith('')
print soup.get_text()
打印:
text ... text
或者,正如@mata建议的那样,您可以使用tag.decompose()
代替tag.replaceWith('')
- 会产生相同的结果,但看起来更合适。
答案 1 :(得分:0)
这适用于XML,如果您希望将其用于HTML,请将导入从BeautifulStoneSoup
更改为BeautifulSoup
try:
#Using bs4
from bs4 import BeautifulStoneSoup
from bs4 import Tag
except ImportError:
#Using bs3
from BeautifulSoup import BeautifulStoneSoup
from BeautifulSoup import Tag
def info_extract(isoup):
'''
Recursively walk a nested list and upon finding a non iterable, return its string
'''
tlist = []
def info_extract_helper(inlist, count = 0):
if(isinstance(inlist, list)):
for q in inlist:
if(isinstance(q, Tag)):
info_extract_helper(q.contents, count + 1)
else:
extracted_str = q.strip()
if(extracted_str and (count > 1)):
tlist.append(extracted_str)
info_extract_helper([isoup])
return tlist
xml_str = \
'''
<?xml version="1.0" encoding="UTF-8"?>
<first-tag>
<second-tag>
<events-data>
<event-date someattrib="test">
<date>20040913</date>
</event-date>
</events-data>
<events-data>
<event-date>
<date>20040913</date>
</event-date>
</events-data>
</second-tag>
</first-tag>
'''
soup = BeautifulStoneSoup(xml_str)
print info_extract(soup)