使用Python删除子字符串

时间:2012-01-02 16:18:44

标签: python regex string

我已经从论坛中提取了一些信息。这是我现在的原始字符串:

string = 'i think mabe 124 + <font color="black"><font face="Times New Roman">but I don\'t have a big experience it just how I see it in my eyes <font color="green"><font face="Arial">fun stuff'

我不喜欢的是子字符串"<font color="black"><font face="Times New Roman">""<font color="green"><font face="Arial">"。我确实想保留字符串的其他部分,除此之外。所以结果应该是这样的

resultString = "i think mabe 124 + but I don't have a big experience it just how I see it in my eyes fun stuff"

我怎么能这样做?实际上我用美丽的汤从论坛中提取上面的字符串。现在我可能更喜欢正则表达式来删除部分。

3 个答案:

答案 0 :(得分:90)

import re
re.sub('<.*?>', '', string)
"i think mabe 124 + but I don't have a big experience it just how I see it in my eyes fun stuff"

re.sub函数采用常规表达式,并用第二个参数替换字符串中的所有匹配项。在这种情况下,我们会搜索所有代码('<.*?>')并将其替换为空('')。

? re用于非贪婪搜索。

有关re module的更多信息。

答案 1 :(得分:14)

>>> import re
>>> st = " i think mabe 124 + <font color=\"black\"><font face=\"Times New Roman\">but I don't have a big experience it just how I see it in my eyes <font color=\"green\"><font face=\"Arial\">fun stuff"
>>> re.sub("<.*?>","",st)
" i think mabe 124 + but I don't have a big experience it just how I see it in my eyes fun stuff"
>>> 

答案 2 :(得分:-3)

RewriteRule ^/.*\.html /index.asp [NC,L]
#for any line that does not begin with /index.asp and is an asp page
RewriteRule (?!^/index.asp)(^/.*\.asp) /index.asp [NC,L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^.* /index.asp [NC,L,QSA]

对于那些正在寻找我的答案的深层信息的人,对不起。

我会解释。

Beautifulsoup是一个广泛使用的python软件包,可帮助用户(开发人员)在python中与HTML进行交互。

上面的代码就像获取所有HTML文本(BeautifulSoup(text, features="html.parser").text )并将其转换为Beautifulsoup对象一样,这意味着它可以解析所有内容(给定文本中的每个HTML标签)

这样做后,我们只需从HTML对象中请求所有文本即可。