Question

所以我试图从

中删除HTML

<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#

所以在解析之后就这样读了

dubstep - 由具有S $＃
的变形金刚创建的音乐

我想从这个html超链接中提取文本dubstep

我该怎么做？

我在这里阅读了解决方案 How to remove tags from a string in python using regular expressions? (NOT in HTML)

但是我得到了

<class 'NameError'>, NameError("name 're' is not defined",), <traceback object at 0x036A41E8>)

Answer 1

以及

 NameError("name 're' is not defined",),

意味着您在开始时忘记了import re，但这是猜测。

另外，由于您只需要<a></a>标记之间的单词，因此需要与此类似的正则表达式：

 .*<a .*>([^<]*)</a>.*

Answer 2

为什么不使用BeautifulSoup？

In [44]: from bs4 import  BeautifulSoup

In [45]: soup = BeautifulSoup ('''<a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#''')

In [46]: soup.find('a').text
Out[46]: u'dubstep'

编辑：

或者如果你只想要文字：

In [48]: soup.text 
Out[48]: u'dubstep the music that is created from transformers having s$#'

Answer 3

使用此：

from bs4 import Beautifulsoup
html = <a href="/define.php?term=dubstep&defid=5175360">dubstep</a> the music that is created from transformers having s$#
soup = Beautifulsoup(html)
print(soup.get_text())

python html条带的超链接文本

3 个答案: