我有一个项目,我正在家里使用rottentomatoes API来收集目前在影院上映的电影。然后它收集这些电影上的所有图像' imdb页面。我遇到的问题是收集图像..这里的目标是让这段代码在8秒内运行,但正则表达式命令和正在运行的是永远的!目前我正在使用正则表达式:
re.findall('<img.*?>', str(line))
其中line是一大块HTML
有没有人有更好的正则表达式,他们可以想到(也许更精致?)欢迎所有评论!
下面的完整代码并附上。
import json, re, pprint, time
from urllib2 import urlopen
def get_image(url):
total = 0
page = urlopen(url).readlines()
for line in page:
hit = re.findall('<img.*?>', str(line))
total += len(hit)
# print('{0} Images total: {1}'.format(url, total))
return total
if __name__ == "__main__":
start = time.time()
json_list = list()
url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=<apikey>"
response = urlopen(url)
data = json.loads(response.read())
for i in data["movies"]:
json_dict = dict()
json_dict["Title"] = str(i['title'])
json_dict["url"] = str("http://www.imdb.com/title/tt" + i['alternate_ids']['imdb'])
json_dict["imdb_id"] = str(i['alternate_ids']['imdb'])
json_dict["count"] = get_image(str(json_dict["url"]) )
json_list.append(json_dict)
end = time.time()
pprint.pprint(json_list)
runtime = end - start
print "Program runtime: " + str(runtime)
答案 0 :(得分:1)
您无法使用正则表达式解析HTML。如果只能使用Python 2的标准库,请使用HTMLParser:
from HTMLParser import HTMLParser
class ImgFinder(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'img':
print 'found img tag, src=', dict(attrs)['src']
parser = ImgFinder()
parser.feed(... HTML source ...)
答案 1 :(得分:1)
虽然你当然应该听听使用正则表达式来解析html(你真的应该使用html解析器)这是一个坏主意的一般智慧,但有一点需要注意你的正则表达式的效率。
比较这两个:
>>> timeit('import re; re.findall("<img.*?>", \'blah blah blah <img src="http://www.example.org/test.jpg"> blah blah blah <img src="http://wwww.example.org/test2.jpg"> blah blah blah\')')
3.366645097732544
>>> timeit('import re; re.findall("<img[^>]*>", \'blah blah blah <img src="http://www.example.org/test.jpg"> blah blah blah <img src="http://wwww.example.org/test2.jpg"> blah blah blah\')')
2.328295946121216
你可以看到后者的正则表达式相当于实际上明显更快。这是因为它不需要回溯。请参阅这篇精彩的博文http://blog.stevenlevithan.com/archives/greedy-lazy-performance,了解其原因。
答案 2 :(得分:0)
虽然我知道使用正则表达式来搜索HTML中的img标签不是 理想的,这是接近我最终去的。通过线程我 能够将运行时间设置为2-12秒,具体取决于 你的联系:
#No shebang line, please run in Linux shell % python img_count.py
#Python libs
import threading, urllib2, re
import Queue, json, time, pprint
#Global lists
JSON_LIST = list()
URLS = list()
def get_movies():
url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey= <apikey>"
response = urllib2.urlopen(url)
data = json.loads(response.read())
return data
def get_imgs(html):
total = 0
# This next line is not ideal. Would much rather use a lib such as Beautiful Soup for this
total += len(re.findall(r"<img[^>]*>", html))
return total
def read_url(url, queue):
data = urllib2.urlopen(url).read()
queue.put(data)
def fetch_urls():
result = Queue.Queue()
threads = [threading.Thread(target=read_url, args = (url,result)) for url in URLS]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
return result
if __name__ == "__main__":
start = time.time()
movies = get_movies()
for movie in movies["movies"]:
url = "http://www.imdb.com/title/tt" + movie['alternate_ids']['imdb']
URLS.append(url)
queue = fetch_urls()
while movies["movies"]:
movie = movies["movies"].pop()
job = queue.get()
total = get_imgs(job)
json_dict = {
"title": movie['title'],
"url": "http://www.imdb.com/title/tt" + movie['alternate_ids']['imdb'],
"imdb_id": movie['alternate_ids']['imdb'],
"count": total
}
JSON_LIST.append(json_dict)
pprint.pprint(JSON_LIST)
end = time.time()
print "\n"
print "Elapsed Time (seconds):", end - start