Python使用beautifulsoup </p>删除字符串中的所有<p>实例

时间:2013-12-10 02:22:49

标签: python beautifulsoup

我使用以下代码获取以下内容

<p>Ibn Umar reported: I passed by the Messenger of Allah, peace and blessings be
 upon him, while my garment was trailing. The Prophet said, ÔÇ£<b>O Abdullah, ra
ise your garment</b>.ÔÇØ I lifted it up and he told me to raise it higher and I
did so. Some of the people said, ÔÇ£To where should it be raised?ÔÇØ The Prophet
 said, ÔÇ£<b>In the middle of the shins</b>.ÔÇØ</p>

我想知道你是否能够帮助我摆脱 <p>, </p> and <b>

代码:

url1 = "http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1) 
english_hadith = soup.findAll('p')[0]
print english_hadith

3 个答案:

答案 0 :(得分:1)

您可以使用nltk执行此操作。

示例:

from nltk import clean_html
html = "..."
clean_html(html)

答案 1 :(得分:0)

我建议使用正则表达式而不是beautifulsoup.

>>> import re
>>> a='<p>dhhdhd<p>dhdhd</p>'
>>> re.sub('<p>|</p>','',a)
'dhhdhddhdhd'

更一般的正则表达式是

re.sub('<p[^>]*>|</p>','',a)

答案 2 :(得分:0)

你很亲密。

print english_hadith.text

显示器:

  Ibn Umar报告说:我经过安拉的使者,和平和祝福在他身上,而我的衣服正在落后。先知说,“阿卜杜拉,你的服装。”我举起它,他告诉我把它抬高,我这样做了。有些人说,“它应该在哪里被提升?”先知说,“在小腿的中间。”