Question

我有一个项目，我正在家里使用rottentomatoes API来收集目前在影院上映的电影。然后它收集这些电影上的所有图像＆＃39; imdb页面。我遇到的问题是收集图像..这里的目标是让这段代码在8秒内运行，但正则表达式命令和正在运行的是永远的！目前我正在使用正则表达式：

re.findall('<img.*?>', str(line))

其中line是一大块HTML

有没有人有更好的正则表达式，他们可以想到（也许更精致？）欢迎所有评论！

下面的完整代码并附上。

import json, re, pprint, time
from urllib2 import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)
    # print('{0} Images total: {1}'.format(url, total))
    return total


if __name__ == "__main__":
    start = time.time()
    json_list = list()
    url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=<apikey>"
    response = urlopen(url)
    data = json.loads(response.read())

    for i in data["movies"]:
        json_dict = dict()
        json_dict["Title"] = str(i['title'])
        json_dict["url"] = str("http://www.imdb.com/title/tt" + i['alternate_ids']['imdb'])
        json_dict["imdb_id"] = str(i['alternate_ids']['imdb'])
        json_dict["count"] = get_image(str(json_dict["url"]) )
        json_list.append(json_dict)
    end = time.time()
    pprint.pprint(json_list)
    runtime =  end - start
    print "Program runtime: " + str(runtime)

Answer 1

您无法使用正则表达式解析HTML。如果只能使用Python 2的标准库，请使用HTMLParser：

from HTMLParser import HTMLParser
class ImgFinder(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            print 'found img tag, src=', dict(attrs)['src']

parser = ImgFinder()
parser.feed(... HTML source ...)

Answer 2

虽然你当然应该听听使用正则表达式来解析html（你真的应该使用html解析器）这是一个坏主意的一般智慧，但有一点需要注意你的正则表达式的效率。

比较这两个：

>>> timeit('import re; re.findall("<img.*?>", \'blah blah blah <img src="http://www.example.org/test.jpg"> blah blah blah <img src="http://wwww.example.org/test2.jpg"> blah blah blah\')')
3.366645097732544
>>> timeit('import re; re.findall("<img[^>]*>", \'blah blah blah <img src="http://www.example.org/test.jpg"> blah blah blah <img src="http://wwww.example.org/test2.jpg"> blah blah blah\')')
2.328295946121216

你可以看到后者的正则表达式相当于实际上明显更快。这是因为它不需要回溯。请参阅这篇精彩的博文http://blog.stevenlevithan.com/archives/greedy-lazy-performance，了解其原因。

Answer 3

虽然我知道使用正则表达式来搜索HTML中的img标签不是理想的，这是接近我最终去的。通过线程我能够将运行时间设置为2-12秒，具体取决于你的联系：

#No shebang line, please run in Linux shell % python img_count.py

#Python libs
import threading, urllib2, re
import Queue, json, time, pprint

#Global lists 
JSON_LIST = list()
URLS = list()

def get_movies():
    url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=    <apikey>"
    response = urllib2.urlopen(url)
    data = json.loads(response.read())    
    return data


def get_imgs(html):
    total = 0
    # This next line is not ideal. Would much rather use a lib such as Beautiful Soup for this
    total += len(re.findall(r"<img[^>]*>", html)) 
    return total


def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    queue.put(data)


def fetch_urls():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in URLS]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    return result


if __name__ == "__main__":
    start = time.time()
    movies = get_movies()
    for movie in movies["movies"]:
        url = "http://www.imdb.com/title/tt" + movie['alternate_ids']['imdb']
        URLS.append(url)    
    queue = fetch_urls()
    while movies["movies"]:
        movie = movies["movies"].pop()
        job = queue.get()
        total = get_imgs(job)    
        json_dict = {
                "title": movie['title'],
                "url": "http://www.imdb.com/title/tt" + movie['alternate_ids']['imdb'],
                "imdb_id": movie['alternate_ids']['imdb'],
                "count": total 
                } 
        JSON_LIST.append(json_dict)      
    pprint.pprint(JSON_LIST)
    end = time.time()
    print "\n"
    print "Elapsed Time (seconds):", end - start

使用re.findall（）解析大块HTML

3 个答案: