使用beautifulsoup删除p标签内的脚本标签

时间:2014-08-09 06:34:39

标签: python html beautifulsoup html-parsing

我编写了一段代码,以便从段落中提取内容

from bs4 import BeautifulSoup
from bs4 import BeautifulSoup, NavigableString
import re


soup = BeautifulSoup(open('MUFC.html'))
a_tag = soup.find_all('p')
#print(a_tag)
for x in a_tag:
    print(x.get_text())

但是p标签内有一些脚本标签

类似

<p>
<script>
.....
</script>
</p>
我不想要的。 我们可以放一些条件来忽略get_text()方法的标签吗?

1 个答案:

答案 0 :(得分:6)

首先,remove所有script代码,然后获取文字:

soup = BeautifulSoup(open('MUFC.html'))

for script in soup.find_all('script'):
    script.extract()

paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.get_text(strip=True))