python html条带的超链接文本

时间:2014-05-24 20:11:42

标签: python html strip

所以我试图从

中删除HTML
<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#

所以在解析之后就这样读了

  

dubstep - 由具有S $#

的变形金刚创建的音乐

我想从这个html超链接中提取文本dubstep

我该怎么做?

我在这里阅读了解决方案 How to remove tags from a string in python using regular expressions? (NOT in HTML)

但是我得到了

<class 'NameError'>, NameError("name 're' is not defined",), <traceback object at 0x036A41E8>)

3 个答案:

答案 0 :(得分:0)

以及

 NameError("name 're' is not defined",),

意味着您在开始时忘记了import re,但这是猜测。

另外,由于您只需要<a></a>标记之间的单词,因此需要与此类似的正则表达式:

 .*<a .*>([^<]*)</a>.*

答案 1 :(得分:0)

为什么不使用BeautifulSoup

In [44]: from bs4 import  BeautifulSoup

In [45]: soup = BeautifulSoup ('''<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#''')

In [46]: soup.find('a').text
Out[46]: u'dubstep'

编辑:

或者如果你只想要文字:

In [48]: soup.text 
Out[48]: u'dubstep the music that is created from transformers having s$#'

答案 2 :(得分:0)

使用此:

from bs4 import Beautifulsoup
html = <a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#
soup = Beautifulsoup(html)
print(soup.get_text())