我编写了一段代码,以便从段落中提取内容
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup, NavigableString
import re
soup = BeautifulSoup(open('MUFC.html'))
a_tag = soup.find_all('p')
#print(a_tag)
for x in a_tag:
print(x.get_text())
但是p标签内有一些脚本标签
类似
<p>
<script>
.....
</script>
</p>
我不想要的。
我们可以放一些条件来忽略get_text()方法的标签吗?
答案 0 :(得分:6)
首先,remove所有script
代码,然后获取文字:
soup = BeautifulSoup(open('MUFC.html'))
for script in soup.find_all('script'):
script.extract()
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
print(paragraph.get_text(strip=True))