我需要将一个压缩的csv导入到mongo集合中,但是有一个catch - 每个记录都包含太平洋时间的时间戳,必须将其转换为对应于(经度,纬度)对中的当前时间。同样的记录。
代码如下:
def read_csv_zip(path, timezones):
with ZipFile(path) as z, z.open(z.namelist()[0]) as input:
csv_rows = csv.reader(input)
header = csv_rows.next()
check,converters = get_aux_stuff(header)
for csv_row in csv_rows:
if check(csv_row):
row = {
converter[0]:converter[1](value)
for converter, value in zip(converters, csv_row)
if allow_field(converter)
}
ts = row['ts']
lng, lat = row['loc']
found_tz_entry = timezones.find_one(SON({'loc': {'$within': {'$box': [[lng-tz_lookup_radius, lat-tz_lookup_radius],[lng+tz_lookup_radius, lat+tz_lookup_radius]]}}}))
if found_tz_entry:
tz_name = found_tz_entry['tz']
local_ts = ts.astimezone(timezone(tz_name)).replace(tzinfo=None)
row['tz'] = tz_name
else:
local_ts = (ts.astimezone(utc) + timedelta(hours = int(lng/15))).replace(tzinfo = None)
row['local_ts'] = local_ts
yield row
def insert_documents(collection, source, batch_size):
while True:
items = list(itertools.islice(source, batch_size))
if len(items) == 0:
break;
try:
collection.insert(items)
except:
for item in items:
try:
collection.insert(item)
except Exception as exc:
print("Failed to insert record {0} - {1}".format(item['_id'], exc))
def main(zip_path):
with Connection() as connection:
data = connection.mydb.data
timezones = connection.timezones.data
insert_documents(data, read_csv_zip(zip_path, timezones), 1000)
代码如下:
当然,时区集合已正确编入索引 - 调用explain()
确认它。
这个过程很慢。当然,必须为每条记录查询 timezones 集合会导致性能下降。我正在寻找有关如何改进它的建议。
感谢。
修改
timezones集合包含8176040条记录,每条记录包含四个值:
> db.data.findOne()
{ "_id" : 3038814, "loc" : [ 1.48333, 42.5 ], "tz" : "Europe/Andorra" }
EDIT2
好的,我编译了http://toblerity.github.com/rtree/的发布版本并配置了rtree包。然后我创建了一个与我的timezones集合对应的rtree dat / idx文件对。因此,我不是致电collection.find_one
而是致电index.intersection
。令人惊讶的是,不仅没有改进,而且现在它的工作速度更慢!可能是rtree可以很好地调整以将整个dat / idx对加载到RAM(704M)中,但我不知道该怎么做。在那之前,它不是替代品。
总的来说,我认为解决方案应该涉及任务的并行化。
EDIT3
使用collection.find_one
时的配置文件输出:
>>> p.sort_stats('cumulative').print_stats(10)
Tue Apr 10 14:28:39 2012 ImportDataIntoMongo.profile
64549590 function calls (64549180 primitive calls) in 1231.257 seconds
Ordered by: cumulative time
List reduced from 730 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.012 0.012 1231.257 1231.257 ImportDataIntoMongo.py:1(<module>)
1 0.001 0.001 1230.959 1230.959 ImportDataIntoMongo.py:187(main)
1 853.558 853.558 853.558 853.558 {raw_input}
1 0.598 0.598 370.510 370.510 ImportDataIntoMongo.py:165(insert_documents)
343407 9.965 0.000 359.034 0.001 ImportDataIntoMongo.py:137(read_csv_zip)
343408 2.927 0.000 287.035 0.001 c:\python27\lib\site-packages\pymongo\collection.py:489(find_one)
343408 1.842 0.000 274.803 0.001 c:\python27\lib\site-packages\pymongo\cursor.py:699(next)
343408 2.542 0.000 271.212 0.001 c:\python27\lib\site-packages\pymongo\cursor.py:644(_refresh)
343408 4.512 0.000 253.673 0.001 c:\python27\lib\site-packages\pymongo\cursor.py:605(__send_message)
343408 0.971 0.000 242.078 0.001 c:\python27\lib\site-packages\pymongo\connection.py:871(_send_message_with_response)
使用index.intersection
时的配置文件输出:
>>> p.sort_stats('cumulative').print_stats(10)
Wed Apr 11 16:21:31 2012 ImportDataIntoMongo.profile
41542960 function calls (41542536 primitive calls) in 2889.164 seconds
Ordered by: cumulative time
List reduced from 778 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.028 0.028 2889.164 2889.164 ImportDataIntoMongo.py:1(<module>)
1 0.017 0.017 2888.679 2888.679 ImportDataIntoMongo.py:202(main)
1 2365.526 2365.526 2365.526 2365.526 {raw_input}
1 0.766 0.766 502.817 502.817 ImportDataIntoMongo.py:180(insert_documents)
343407 9.147 0.000 491.433 0.001 ImportDataIntoMongo.py:152(read_csv_zip)
343406 0.571 0.000 391.394 0.001 c:\python27\lib\site-packages\rtree-0.7.0-py2.7.egg\rtree\index.py:384(intersection)
343406 379.957 0.001 390.824 0.001 c:\python27\lib\site-packages\rtree-0.7.0-py2.7.egg\rtree\index.py:435(_intersection_obj)
686513 22.616 0.000 38.705 0.000 c:\python27\lib\site-packages\rtree-0.7.0-py2.7.egg\rtree\index.py:451(_get_objects)
343406 6.134 0.000 33.326 0.000 ImportDataIntoMongo.py:162(<dictcomp>)
346 0.396 0.001 30.665 0.089 c:\python27\lib\site-packages\pymongo\collection.py:240(insert)
EDIT4
我已经对代码进行了并行化,但结果仍然不是很令人鼓舞。我相信它可以做得更好。有关详细信息,请参阅我自己对此问题的回答。
答案 0 :(得分:0)
好的,我已经对代码进行了并行化,但运行速度只提高了两倍,这是我的解决方案:
write_batch_size=100
read_batch_size=100
count_parsed_csv_consumers=15
count_data_records_consumers=1
parsed_csv_queue = Queue()
data_record_queue = Queue()
def get_parsed_csv_consumer(converters, timezones):
def do_work(csv_row):
row = {
converter[0]:converter[1](value)
for converter, value in zip(converters, csv_row)
if allow_field(converter)
}
ts = row['ts']
lng, lat = row['loc']
found_tz_entry = timezones.find_one(SON({'loc': {'$within': {'$box': [[lng-tz_lookup_radius, lat-tz_lookup_radius],[lng+tz_lookup_radius, lat+tz_lookup_radius]]}}}))
if found_tz_entry:
tz_name = found_tz_entry['tz']
local_ts = ts.astimezone(timezone(tz_name)).replace(tzinfo=None)
row['tz'] = tz_name
else:
local_ts = (ts.astimezone(utc) + timedelta(hours = int(lng/15))).replace(tzinfo = None)
row['local_ts'] = local_ts
return row
def worker():
while True:
csv_rows = parsed_csv_queue.get();
try:
rows=[]
for csv_row in csv_rows:
rows.append(do_work(csv_row))
data_record_queue.put_nowait(rows)
except Exception as exc:
print(exc)
parsed_csv_queue.task_done()
return worker
def get_data_record_consumer(collection):
items = []
def do_work(row):
items.append(row)
if len(items) == write_batch_size:
persist_items()
def persist_items():
try:
collection.insert(items)
except:
for item in items:
try:
collection.insert(item)
except Exception as exc:
print("Failed to insert record {0} - {1}".format(item['_id'], exc))
del items[:]
def data_record_consumer():
collection # explicit capture
while True:
rows = data_record_queue.get()
try:
if rows:
for row in rows:
do_work(row)
elif items:
persist_items()
except Exception as exc:
print(exc)
data_record_queue.task_done()
return data_record_consumer
def import_csv_zip_to_collection(path, timezones, collection):
def get_threads(count, target, name):
acc = []
for i in range(count):
x = Thread(target=target, name=name + " " + str(i))
x.daemon = True
x.start()
acc.append(x)
return acc
with ZipFile(path) as z, z.open(z.namelist()[0]) as input:
csv_rows = csv.reader(input)
header = next(csv_rows)
check,converters = get_aux_stuff(header)
parsed_csv_consumer_threads = get_threads(count_parsed_csv_consumers, get_parsed_csv_consumer(converters, timezones), "parsed csv consumer")
data_record_consumer_threads = get_threads(count_data_records_consumers, get_data_record_consumer(collection), "data record consumer")
read_batch = []
for csv_row in csv_rows:
if check(csv_row):
read_batch.append(csv_row)
if len(read_batch) == read_batch_size:
parsed_csv_queue.put_nowait(read_batch)
read_batch = []
if len(read_batch) > 0:
parsed_csv_queue.put_nowait(read_batch)
read_batch = []
parsed_csv_queue.join()
data_record_queue.join()
# data record consumers may have some items cached. All of them must flush their caches now.
# we do it by enqueing a special item, which when fetched causes the respective consumer to
# terminate its operation
for i in range(len(data_record_consumer_threads)):
data_record_queue.put_nowait(None)
data_record_queue.join()
过程如下:
read_batch_size
确定)parsed_csv_queue
中,供parsed_csv_consumer_threads
timezones.find_one
)查找时区。因此,其中有很多,count_parsed_csv_consumers
是准确的。read_batch_size
),并且批次完全放入另一个队列后 - data_record_queue
data_record_queue
获取批量数据记录并将其插入目标mongo集合。 count_data_records_consumers
常量来改变。在第一个版本中,我将单个记录放入队列中,但是分析显示Queue.put_nowait
非常昂贵,因此我被迫通过批量记录来减少放置次数。
无论如何,性能提升了两倍,但我希望得到更好的结果。以下是分析结果:
>>> p.sort_stats('cumulative').print_stats(10)
Fri Apr 13 13:31:17 2012 ImportOoklaIntoMongo.profile
3782711 function calls (3782429 primitive calls) in 310.209 seconds
Ordered by: cumulative time
List reduced from 737 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.016 0.016 310.209 310.209 .\ImportOoklaIntoMongo.py:1(<module>)
1 0.004 0.004 309.833 309.833 .\ImportOoklaIntoMongo.py:272(main)
1 17.829 17.829 220.432 220.432 .\ImportOoklaIntoMongo.py:225(import_csv_zip_to_collection)
386081 28.049 0.000 135.297 0.000 c:\python27\lib\zipfile.py:508(readline)
107008 7.588 0.000 102.938 0.001 c:\python27\lib\zipfile.py:570(read)
107008 50.716 0.000 95.302 0.001 c:\python27\lib\zipfile.py:598(read1)
71240 3.820 0.000 95.292 0.001 c:\python27\lib\zipfile.py:558(peek)
1 89.382 89.382 89.382 89.382 {raw_input}
386079 43.564 0.000 54.706 0.000 .\ImportOoklaIntoMongo.py:103(check)
35767 40.286 0.001 40.286 0.001 {built-in method decompress}
我对Profiler输出有点怀疑,因为它似乎只显示主线程结果。确实 - How can I profile a multithread program in Python?