我有一个包含超过300万个项目行的tsv文件。每个Item都有一个id,group和url,group列被排序。
即
document.getElementById("text").innerHTML = "<pre>" + encrypt("maybe do i really", 1) + "</pre>";
我将它加载到python脚本中,需要在将这些项加载到数据库之前检查组中所有项的url的状态200 OK。我想过使用流程并对每个流程进行URL检查(我没有太多的经验,所以不确定它是否是一个好主意)
我的逻辑atm:用带有gr1的项目填充数组a1 - &gt;将a1中的每个项目传递给新流程 - &gt;该过程检查200 - &gt;如果确定,将它放入数组a2中 - &gt;当检查a1中的所有项目时,将a2推送到DB(以及其他内容) - &gt;重复
对于100,000件物品,需要30分钟。瓶颈是URL检查。没有检查URL,相比之下脚本是闪电般快速的。到目前为止:
x1 gr1 {some url}/x1.jpg
x2 gr1 {some url}/x2.jpg
x3 gr2 {some url}/x1.jpg
其他注意事项:将原始Tsv拆分为30个单独的tsv文件并并行运行批处理脚本30次。这会有所作为吗?
答案 0 :(得分:3)
import asyncio
from aiohttp import ClientSession, Timeout
import csv
import re
from threading import Thread
from queue import Queue
from time import sleep
async def check(url, session):
try:
with Timeout(10):
async with session.head(url) as response:
if response.status == 200:
return True
elif response.status == 404:
return False
else:
async with session.get(url) as response:
return (response.status == 200)
except:
return False
def worker(q):
while True:
f = q.get()
try:
f()
except Exception as e:
print(e)
q.task_done()
q = Queue()
for i in range(4):
t = Thread(target=worker,args=(q,))
t.daemon = True
t.start()
def item_ok(url):
#Do something
sleep(0.5)
pass
def item_failed(url):
#Do something
sleep(0.5)
pass
def group_done(name,g):
print("group %s with %d items done (%d failed)\n" %
(name,g['total'],g['fail']))
async def bound_check(sem, item, session, groups):
async with sem:
g = groups[item["group"]]
if (await check(item["item_url"], session)):
g["success"] += 1
q.put(lambda: item_ok(item["item_url"]))
else:
g["fail"] += 1
q.put(lambda: item_failed(item["item_url"]))
if g["success"] + g["fail"] == g['total']:
q.put(lambda: group_done(item['group'],g))
bound_check.processed += 1
if bound_check.processed % 100 == 0:
print ("Milestone: %d\n" % bound_check.processed)
bound_check.processed = 0
groups = {}
async def run(max_pending=1000):
#Choose such that you do not run out of FDs
sem = asyncio.Semaphore(max_pending)
f = open('./test.tsv', 'r',encoding = 'utf8')
reader = csv.reader(f, delimiter='\n')
tasks = []
async with ClientSession() as session:
for _, utf8_row in enumerate(reader):
unicode_row = utf8_row[0]
x = re.split(r'\t', unicode_row)
item = {"id": x[0],"group": x[1],"item_url": x[2]}
if not item["group"] in groups:
groups[item["group"]] = {'total' : 1,
'success' : 0,
'fail' : 0,
'items' : [item]}
else:
groups[item["group"]]['total'] += 1
groups[item["group"]]['items'].append(item)
task = asyncio.ensure_future(bound_check(sem, item, session, groups))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
loop = asyncio.get_event_loop()
loop.run_until_complete(run())
q.join()
print("Done")
答案 1 :(得分:2)
已经提到过,您应该尝试使用HEAD
而不是GET
。这样可以避免下载图像。此外,您似乎每个请求都会生成一个单独的进程,这也是低效的。
我不认为在这里确实需要使用asyncio,性能方面。使用普通线程池(甚至不是进程池)的解决方案有点简单,恕我直言:)另外,它在Python 2.7中可用。
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
from collections import defaultdict
def read_rows(file):
with open(file) as f_in:
return [row for row in csv.reader(f_in, delimiter='\t')]
def check_url(inp):
"""Gets called by workers in thread pool. Checks for existence of URL."""
id, grp, url = inp
def chk():
try:
return requests.head(url).status_code == 200
except IOError as e:
return False
return (id, grp, url, chk())
if __name__ == '__main__':
d = defaultdict(lambda: [])
with ThreadPoolExecutor(max_workers=20) as executor:
future_to_input = {executor.submit(check_url, inp): inp for inp in read_rows('urls.txt')}
for future in as_completed(future_to_input):
id, grp, url, res = future.result()
d[grp].append((id, url, res))
# do something with your d (e.g. sort appropriately, filter those with len(d[grp]) <= 1, ...)
for g, bs in d.items():
print(g)
for id, url, res in bs:
print(" %s %5s %s" % (id, res, url))
如您所见,我单独处理CSV输入的每一行,并对结果进行分组(使用d
),而不是输入。我想这主要是品味问题。您可能想要使用max_workers=20
并可能增加它。