Question

我刚刚开始使用Python，我正试图将IMDB上的前250部电影与这个故障代码相匹配：

import urllib2
import re

def main():
    response = urllib2.urlopen('http://www.imdb.com/chart/top')
    html = response.read()
    entries = re.findall("/title/.*</font>", html) #Wrong regex
    print entries

if __name__ == "__main__":
    main()

我的理由是，我希望匹配/title/和</font>之间的所有内容，因此.*介于两者之间，但显然这不是正确的方式，因为它只是匹配整个列表而不是每个单独的条目。我对在线阅读的正则表达式教程感到困惑....帮助？

Answer 1

因此，尝试使用正则表达式解析HTML是一种不好的做法，可以处理html parsers构建的这类事情。 python中有许多选项，如Beautiful Soup，lxml等。

我将展示如何使用lxml和XPath expressions来获取所有前250个标题

import lxml
from lxml import etree
import urllib2

response = urllib2.urlopen('http://www.imdb.com/chart/top')
html = response.read()
imdb = etree.HTML(html)
titles = imdb.xpath('//div[@id="main"]/table//tr//a/text()')

如果您执行print titles[0]，则会将'The Shawshank Redemption'作为输出。对于，XPath使用firefox的firebug扩展名或安装firepath

Answer 2

试试这个

def main(s):
    response = urllib2.urlopen('http://www.imdb.com/chart/top')
    html = response.read()
    entries = re.findall("<a.*?/title/(.*?)/\">(.*?)</a>", html) #Wrong regex
    return entries

它使用imdb id和标题的组。条目将是元组列表

Answer 3

您不应该使用正则表达式进行html解析。您应该使用专门的html解析器。看一下： RegEx match open tags except XHTML self-contained tags

Answer 4

使用lxml和XPath：

可以很简单地完成

import lxml.html

doc = lxml.html.parse('http://www.imdb.com/chart/top')
titles  = doc.xpath('//div[@id="main"]/table//a/text()')

print u'\n'.join(titles)

Python正则表达与IMDB前250名列表

4 个答案: