我正在开发一个聚合平台。我们希望存储'聚合'的已调整大小的版本。来自我们服务器上的网络图像。具体而言,这些图像来自不同供应商的电子商务产品。 '项目'字典有"图像"字段是一个url,需要下载并压缩并保存到磁盘。
下载和压缩方法 :
def downloadCompressImage( url, width, item):
#Retrieve our source image from a URL
#Load the URL data into an image
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
img = cStringIO.StringIO(response.read())
im = Image.open(img)
wpercent = (width/float(im.size[0]))
hsize = int((float(im.size[1])*float(wpercent)))
#Resize the image
im2 = im.resize((width, hsize), Image.ANTIALIAS)
key_name = item["vendor"] + "_" + hashlib.md5(url.encode('utf-8')).hexdigest()+ "_" + str(width) + "x" + str(hsize) + ".jpg"
path = "/var/www/html/server/images/"
path = path + timestamp + "/"
#save compressed image to disk
im2.save(path + key_name, 'JPEG', quality = 85)
url = "http://server.com/images/" + timestamp + "/" + key_name
return url
工人方法:
def worker(lines):
"""Make a dict out of the parsed, supplied lines"""
result = []
for line in lines:
line = line.rstrip('\n')
item = json.loads(line.decode('ascii', 'ignore'))
#
#Do stuff with the item dict and update it
#
# Append item to result if image dl and compression is successful
try:
item["grid_image"] = downloadCompressImage(item["image"],200, item)
except:
print "dl-comp exception in processing: " + item['name'] + item['vendor']
traceback.print_exc(file=sys.stdout)
continue
if(item["grid_image"] != -1):
result.append(item)
return result
主要方法:
if __name__ == '__main__':
# configurable options. different values may work better.
numthreads = 15
numlines = 1000
lines = open('parserProducts.json').readlines()
# create the process pool
pool = multiprocessing.Pool(processes=numthreads)
for result_lines in pool.imap(worker,(lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) ):
for line in result_lines:
jdata = json.dumps(line)
f.write(jdata+',\n')
pool.close()
pool.join()
f.seek(-2, os.SEEK_END)
f.truncate()
f.write(']')
print "parsing is done"
我的问题: 这是我用python做的最好的吗?字典项的计数是~3 M.没有调用" downloadCompresssImage" ''中的方法" #Do与项目dict并更新它"部分只需8分钟即可完成。虽然压缩,但似乎需要几周,甚至几个月。
任何想法都赞赏,感谢一帮。
答案 0 :(得分:0)
您正在使用300万张图片,这些图片是从互联网下载然后压缩的。所以需要花多少时间,取决于我能说的两件事。
所以,不是Python限制你,你使用multiprocessing.Pool
做得很好,主要的瓶颈是你的网络速度和你拥有的核心数(或CPU能力)。