使用Python中的re库从字符串中删除()

时间:2013-12-12 11:52:43

标签: python beautifulsoup

如何从

下面的字符串中删除以下(<span class=saws></span>)
<p>In the house of Um-Salama I saw Allah's Messenger (<span class=saws></span>) offering prayers, wrapped in a single garment 
around his body with its ends crossed round his shoulders.</b></div>

我已尝试过所有内容,我设法移除<span class=saws></span>,但我现在无法摆脱()

代码:

url = "http://www.sunnah.com/bukhari/8"

parser = etree.HTMLParser()
html   = etree.parse(url, parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result) 

results = soup.findAll("div", {"class" : "actualHadithContainer"})
for result in results :

    en = re.sub('</span>|<div class="text_details">|</div>|</p>|<p>|<span class=|[??]|("saws">)','',str(result.find("div", {"class" : "text_details"})))
    en1 = re.sub('()','',str(en))
    print en1
    ar1 = re.sub('<span class="arabic_sanad arabic">|</span>','',str(result.find("span", {"class" : "arabic_sanad arabic"})))
    ar2 = re.sub('<span class="arabic_text_details arabic">|</span>|<span class="arabic_text_details arabic">','',str(result.find("span", {"class" : "arabic_text_details arabic"})))
    print ar1 + ar2

3 个答案:

答案 0 :(得分:2)

这样简单的事情

(\(<span\sclass\=saws\>.*</span>\))

这将删除整个(<span class=saws></span>)

请参阅http://regex101.com/r/uL3fV4了解实时演示

答案 1 :(得分:0)

我与BeatufulSoup的例子:

soup = BeautifulSoup(u"""<p>In the house of Um-Salama I saw Allah's 
Messenger (<span class=saws></span>) offering prayers, wrapped in a single garment 
around his body with its ends crossed round his shoulders.</b></div>""")
results = soup.findAll()
for tag in results:
    if tag.name == 'span' and 'saws' in tag.attrs.get('class', []):
        tag.extract()

print re.sub(ur'\(\)', u'', unicode(soup))

答案 2 :(得分:0)

#! /usr/bin/env python
from bs4 import BeautifulSoup                                                                                                                                 
import urllib2 
import lxml    
from lxml import etree                                                                                                                                           
import re

url = "http://www.sunnah.com/bukhari/8"

parser = etree.HTMLParser()
html   = etree.parse(url, parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
# content1 = urllib2.urlopen(url).read()
soup = BeautifulSoup(result) 

results = soup.findAll("div", {"class" : "actualHadithContainer"})
for result in results :

    en = re.sub('</span>|<div class="text_details">|</div>|</p>|<p>|[??]|\(<span class="saws"></span>\)|<b>|</b>','',str(result.find("div", {"class" : "text_details"})))
    print en

    ar1 = re.sub('<span class="arabic_sanad arabic">|</span>','',str(result.find("span", {"class" : "arabic_sanad arabic"})))
    ar2 = re.sub('<span class="arabic_text_details arabic">|</span>|<span class="arabic_text_details arabic">','',str(result.find("span", {"class" : "arabic_text_details arabic"})))
    print ar1 + ar2