使用re.findall()解析大块HTML

时间:2014-09-19 17:51:08

标签: python regex urllib2

我有一个项目,我正在家里使用rottentomatoes API来收集目前在影院上映的电影。然后它收集这些电影上的所有图像' imdb页面。我遇到的问题是收集图像..这里的目标是让这段代码在8秒内运行,但正则表达式命令和正在运行的是永远的!目前我正在使用正则表达式:

re.findall('<img.*?>', str(line))

其中line是一大块HTML

有没有人有更好的正则表达式,他们可以想到(也许更精致?)欢迎所有评论!

下面的完整代码并附上。

import json, re, pprint, time
from urllib2 import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)
    # print('{0} Images total: {1}'.format(url, total))
    return total


if __name__ == "__main__":
    start = time.time()
    json_list = list()
    url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=<apikey>"
    response = urlopen(url)
    data = json.loads(response.read())

    for i in data["movies"]:
        json_dict = dict()
        json_dict["Title"] = str(i['title'])
        json_dict["url"] = str("http://www.imdb.com/title/tt" + i['alternate_ids']['imdb'])
        json_dict["imdb_id"] = str(i['alternate_ids']['imdb'])
        json_dict["count"] = get_image(str(json_dict["url"]) )
        json_list.append(json_dict)
    end = time.time()
    pprint.pprint(json_list)
    runtime =  end - start
    print "Program runtime: " + str(runtime)

3 个答案:

答案 0 :(得分:1)

您无法使用正则表达式解析HTML。如果只能使用Python 2的标准库,请使用HTMLParser:

from HTMLParser import HTMLParser
class ImgFinder(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            print 'found img tag, src=', dict(attrs)['src']

parser = ImgFinder()
parser.feed(... HTML source ...)

答案 1 :(得分:1)

虽然你当然应该听听使用正则表达式来解析html(你真的应该使用html解析器)这是一个坏主意的一般智慧,但有一点需要注意你的正则表达式的效率。

比较这两个:

>>> timeit('import re; re.findall("<img.*?>", \'blah blah blah <img src="http://www.example.org/test.jpg"> blah blah blah <img src="http://wwww.example.org/test2.jpg"> blah blah blah\')')
3.366645097732544
>>> timeit('import re; re.findall("<img[^>]*>", \'blah blah blah <img src="http://www.example.org/test.jpg"> blah blah blah <img src="http://wwww.example.org/test2.jpg"> blah blah blah\')')
2.328295946121216

你可以看到后者的正则表达式相当于实际上明显更快。这是因为它不需要回溯。请参阅这篇精彩的博文http://blog.stevenlevithan.com/archives/greedy-lazy-performance,了解其原因。

答案 2 :(得分:0)

  

虽然我知道使用正则表达式来搜索HTML中的img标签不是   理想的,这是接近我最终去的。通过线程我   能够将运行时间设置为2-12秒,具体取决于   你的联系:

#No shebang line, please run in Linux shell % python img_count.py

#Python libs
import threading, urllib2, re
import Queue, json, time, pprint

#Global lists 
JSON_LIST = list()
URLS = list()

def get_movies():
    url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=    <apikey>"
    response = urllib2.urlopen(url)
    data = json.loads(response.read())    
    return data


def get_imgs(html):
    total = 0
    # This next line is not ideal. Would much rather use a lib such as Beautiful Soup for this
    total += len(re.findall(r"<img[^>]*>", html)) 
    return total


def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    queue.put(data)


def fetch_urls():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in URLS]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    return result


if __name__ == "__main__":
    start = time.time()
    movies = get_movies()
    for movie in movies["movies"]:
        url = "http://www.imdb.com/title/tt" + movie['alternate_ids']['imdb']
        URLS.append(url)    
    queue = fetch_urls()
    while movies["movies"]:
        movie = movies["movies"].pop()
        job = queue.get()
        total = get_imgs(job)    
        json_dict = {
                "title": movie['title'],
                "url": "http://www.imdb.com/title/tt" + movie['alternate_ids']['imdb'],
                "imdb_id": movie['alternate_ids']['imdb'],
                "count": total 
                } 
        JSON_LIST.append(json_dict)      
    pprint.pprint(JSON_LIST)
    end = time.time()
    print "\n"
    print "Elapsed Time (seconds):", end - start