Question

我正在开发一个聚合平台。我们希望存储＆＃39;聚合＆＃39;的已调整大小的版本。来自我们服务器上的网络图像。具体而言，这些图像来自不同供应商的电子商务产品。＆＃39;项目＆＃39;字典有＆＃34;图像＆＃34;字段是一个url，需要下载并压缩并保存到磁盘。

下载和压缩方法 ：

def downloadCompressImage( url, width, item):
    #Retrieve our source image from a URL

    #Load the URL data into an image
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    response = opener.open(url)
    img = cStringIO.StringIO(response.read())   
    im = Image.open(img)

    wpercent = (width/float(im.size[0]))
    hsize = int((float(im.size[1])*float(wpercent)))

    #Resize the image
    im2 = im.resize((width, hsize), Image.ANTIALIAS)  

    key_name = item["vendor"] + "_" + hashlib.md5(url.encode('utf-8')).hexdigest()+ "_" + str(width) + "x" + str(hsize) + ".jpg"

    path = "/var/www/html/server/images/" 
    path = path + timestamp + "/"

    #save compressed image to disk
    im2.save(path + key_name, 'JPEG', quality = 85)
    url = "http://server.com/images/" + timestamp + "/" + key_name
    return url

工人方法：

def worker(lines):
"""Make a dict out of the parsed, supplied lines"""
result = []
for line in lines:
    line = line.rstrip('\n') 
    item = json.loads(line.decode('ascii', 'ignore'))

    #
    #Do stuff with the item dict and update it
    #

    # Append item to result if image dl and compression is successful
    try:
        item["grid_image"] = downloadCompressImage(item["image"],200, item)
    except:
        print "dl-comp exception in processing: " + item['name'] + item['vendor']
        traceback.print_exc(file=sys.stdout)
        continue

    if(item["grid_image"] != -1):
        result.append(item)

return result

主要方法：

if __name__ == '__main__':
# configurable options.  different values may work better.
numthreads = 15
numlines = 1000

lines = open('parserProducts.json').readlines()

# create the process pool
pool = multiprocessing.Pool(processes=numthreads)

for result_lines in pool.imap(worker,(lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) ):
    for line in result_lines:
        jdata = json.dumps(line)
        f.write(jdata+',\n')

pool.close()
pool.join()

f.seek(-2, os.SEEK_END)
f.truncate()
f.write(']')

print "parsing is done"

我的问题： 这是我用python做的最好的吗？字典项的计数是~3 M.没有调用＆＃34; downloadCompresssImage＆＃34; ＆＃39;＆＃39;中的方法＆＃34; #Do与项目dict并更新它＆＃34;部分只需8分钟即可完成。虽然压缩，但似乎需要几周，甚至几个月。

任何想法都赞赏，感谢一帮。

Answer 1

您正在使用300万张图片，这些图片是从互联网下载然后压缩的。所以需要花多少时间，取决于我能说的两件事。

您的网络速度（以及目标服务器的速度），以下载图像。
您的CPU功率，以压缩图像。

所以，不是Python限制你，你使用multiprocessing.Pool做得很好，主要的瓶颈是你的网络速度和你拥有的核心数（或CPU能力）。

在Python中快速下载和压缩网络图像

1 个答案: