所以我试图从
中删除HTML<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#
所以在解析之后就这样读了
dubstep - 由具有S $#
的变形金刚创建的音乐
我想从这个html超链接中提取文本dubstep
我该怎么做?
我在这里阅读了解决方案 How to remove tags from a string in python using regular expressions? (NOT in HTML)
但是我得到了
<class 'NameError'>, NameError("name 're' is not defined",), <traceback object at 0x036A41E8>)
答案 0 :(得分:0)
以及
NameError("name 're' is not defined",),
意味着您在开始时忘记了import re
,但这是猜测。
另外,由于您只需要<a></a>
标记之间的单词,因此需要与此类似的正则表达式:
.*<a .*>([^<]*)</a>.*
答案 1 :(得分:0)
为什么不使用BeautifulSoup
?
In [44]: from bs4 import BeautifulSoup
In [45]: soup = BeautifulSoup ('''<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#''')
In [46]: soup.find('a').text
Out[46]: u'dubstep'
编辑:
或者如果你只想要文字:
In [48]: soup.text
Out[48]: u'dubstep the music that is created from transformers having s$#'
答案 2 :(得分:0)
使用此:
from bs4 import Beautifulsoup
html = <a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#
soup = Beautifulsoup(html)
print(soup.get_text())