为什么这个正则表达式不起作用

时间:2014-05-17 12:06:38

标签: python regex

我的网页源代码格式如下:

<span class="l r positive-icon">
Turkish
</span>
<span>
The.Mist[2007]DvDrip[Eng]-aXXo
</span>
<span class="l r neutral-icon">
Vietnamese
</span>
<span>
The.Mist.2007.720p.Bluray.x264.YIFY 
</span>

正如您所看到的,有两种跨度的“l r positive-icon”“l r neutral-icon”。我想只获得语言,所以在任何类的跨度之间的所有内容。我使用这个正则表达式,但它给了我一个空列表:

allLanguages = re.findall('<span class=".*">\s(.*)\s</span>', allLanguagesTags)

allLanguagesTags包含上面显示的源代码。任何人都可以给我一个暗示吗?

1 个答案:

答案 0 :(得分:3)

不要使用正则表达式。使用实际的HTML解析器。我建议您改为使用BeautifulSoup

from bs4 import BeautifulSoup

soup = BeautifulSoup(yourhtml)
languages = [s.get_text().strip() for s in soup.find_all('span', class_=True)]

演示:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <span class="l r positive-icon">
... Turkish
... </span>
... <span>
... The.Mist[2007]DvDrip[Eng]-aXXo
... </span>
... <span class="l r neutral-icon">
... Vietnamese
... </span>
... <span>
... The.Mist.2007.720p.Bluray.x264.YIFY 
... </span>
... ''')
>>> [s.get_text().strip() for s in soup.find_all('span', class_=True)]
[u'Turkish', u'Vietnamese']