使用Python有效地检查数百万个图像URL

时间:2017-02-10 02:52:42

标签: python python-2.7 for-loop optimization

我有一个包含超过300万个项目行的tsv文件。每个Item都有一个id,group和url,group列被排序。

document.getElementById("text").innerHTML = "<pre>" + encrypt("maybe do i really", 1) + "</pre>";

我将它加载到python脚本中,需要在将这些项加载到数据库之前检查组中所有项的url的状态200 OK。我想过使用流程并对每个流程进行URL检查(我没有太多的经验,所以不确定它是否是一个好主意)

我的逻辑atm:用带有gr1的项目填充数组a1 - &gt;将a1中的每个项目传递给新流程 - &gt;该过程检查200 - &gt;如果确定,将它放入数组a2中 - &gt;当检查a1中的所有项目时,将a2推送到DB(以及其他内容) - &gt;重复

对于100,000件物品,需要30分钟。瓶颈是URL检查。没有检查URL,相比之下脚本是闪电般快速的。到目前为止:

x1    gr1    {some url}/x1.jpg
x2    gr1    {some url}/x2.jpg
x3    gr2    {some url}/x1.jpg  

其他注意事项:将原始Tsv拆分为30个单独的tsv文件并并行运行批处理脚本30次。这会有所作为吗?

2 个答案:

答案 0 :(得分:3)

  1. 由于您不需要使用HEAD请求的实际图像,因此应提高速度。如果响应既不是200也不是404,则可能不允许HEAD(405)并且您只是使用GET请求再次尝试。
  2. 您目前在开始任何新任务之前等待当前小组完成。通常,最好始终保持相同数量的运行请求大致相同。此外,您可能希望大幅增加工作池 - 因为任务主要是I / O-Bound,但我建议您按照3行(即异步I / O)执行某些操作。
  3. 如果您愿意使用Python 3,则可以使用https://docs.python.org/3/library/asyncio.html来充分利用异步I / O(https://pypi.python.org/pypi/aiohttp)的支持:
  4. import asyncio
    from aiohttp import ClientSession, Timeout
    import csv
    import re
    from threading import Thread
    from queue import Queue
    from time import sleep
    
    async def check(url, session):
        try:
            with Timeout(10):
                async with session.head(url) as response:
                    if response.status == 200:
                        return True
                    elif response.status == 404:
                        return False
                    else:
                        async with session.get(url) as response:
                            return (response.status == 200)
        except:
            return False
    
    
    
    def worker(q):
        while True:
            f = q.get()
            try:
                f()
            except Exception as e:
                print(e)
            q.task_done()
    
    q = Queue()
    for i in range(4):
         t = Thread(target=worker,args=(q,))
         t.daemon = True
         t.start()
    
    def item_ok(url):
        #Do something
        sleep(0.5)
        pass
    
    def item_failed(url):
        #Do something
        sleep(0.5)
        pass
    
    def group_done(name,g):
        print("group %s with %d items done (%d failed)\n" %
              (name,g['total'],g['fail']))
    
    async def bound_check(sem, item, session, groups):
        async with sem:
            g = groups[item["group"]]
            if (await check(item["item_url"], session)):
                g["success"] += 1
                q.put(lambda: item_ok(item["item_url"]))
            else:
                g["fail"] += 1
                q.put(lambda: item_failed(item["item_url"]))
            if g["success"] + g["fail"] == g['total']:
                q.put(lambda: group_done(item['group'],g))
            bound_check.processed += 1
            if bound_check.processed % 100 == 0:
                print ("Milestone: %d\n" % bound_check.processed)
    
    bound_check.processed = 0
    
    groups = {}
    
    async def run(max_pending=1000):
        #Choose such that you do not run out of FDs
        sem = asyncio.Semaphore(max_pending)
    
        f = open('./test.tsv', 'r',encoding = 'utf8')
        reader = csv.reader(f, delimiter='\n')
    
        tasks = []
    
        async with ClientSession() as session:
            for _, utf8_row in enumerate(reader):
                unicode_row = utf8_row[0]
                x = re.split(r'\t', unicode_row)
                item = {"id": x[0],"group": x[1],"item_url": x[2]}
                if not item["group"] in groups:
                    groups[item["group"]] = {'total'    : 1,
                                             'success'  : 0,
                                             'fail'     : 0,
                                             'items'    : [item]}
                else:
                    groups[item["group"]]['total'] += 1
                    groups[item["group"]]['items'].append(item)
                task = asyncio.ensure_future(bound_check(sem, item, session, groups))
                tasks.append(task)
    
            responses = asyncio.gather(*tasks)
            await responses
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(run())
    q.join()
    
    print("Done")
    

答案 1 :(得分:2)

已经提到过,您应该尝试使用HEAD而不是GET。这样可以避免下载图像。此外,您似乎每个请求都会生成一个单独的进程,这也是低效的。

我不认为在这里确实需要使用asyncio,性能方面。使用普通线程池(甚至不是进程池)的解决方案有点简单,恕我直言:)另外,它在Python 2.7中可用。

import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
from collections import defaultdict

def read_rows(file):
    with open(file) as f_in:
        return [row for row in csv.reader(f_in, delimiter='\t')]

def check_url(inp):
    """Gets called by workers in thread pool. Checks for existence of URL."""
    id, grp, url = inp
    def chk():
        try:
            return requests.head(url).status_code == 200
        except IOError as e:
            return False
    return (id, grp, url, chk())

if __name__ == '__main__':
    d = defaultdict(lambda: [])
    with ThreadPoolExecutor(max_workers=20) as executor:
        future_to_input = {executor.submit(check_url, inp): inp for inp in read_rows('urls.txt')}
        for future in as_completed(future_to_input):
            id, grp, url, res = future.result()
            d[grp].append((id, url, res))
    # do something with your d (e.g. sort appropriately, filter those with len(d[grp]) <= 1, ...)
    for g, bs in d.items():
        print(g)
        for id, url, res in bs:
            print("  %s %5s %s" % (id, res, url))

如您所见,我单独处理CSV输入的每一行,并对结果进行分组(使用d),而不是输入。我想这主要是品味问题。您可能想要使用max_workers=20并可能增加它。