Question

我需要将超过半TB的数据集成到MongoDB（约525GB）中。

他们访问我的网站，文件的每一行都是制表符分隔的字符串

这是我的主要循环：

f = codecs.open(directoryLocation+fileWithData, encoding='utf-8')
count = 0 #start the line counter

for line in f:
    print line
    line = line.split('\t')

    document = {
        'date': line[0],     # a date as a string
        'user_id': line[1],  # a string 
        'datetime': line[2], # a unix timestamp
        'url': line[3],      # a fairly long string
        'ref_url': line[4],  # another fairly long string
        'date_obj': datetime.utcfromtimestamp(float(line[2])) #a date object
    }

    Visits.insert(document)

    #line integration time/stats
    count = count + 1 #increment the counter
    now = datetime.now()
    diff = now - startTime
    taken = diff.seconds
    avgPerLine = float(taken) / float(count)
    totalTimeLeft = (howManyLinesTotal-count) * avgPerLine
    print "Time left (mins): " + str(totalTimeLeft/60) #output the stats
    print "Avg per line: " + str(avgPerLine)

我目前每行约0.00095秒，考虑到我需要集成的数据量，这真的很慢。

I̶'̶m̶要启用̶p̶y̶m̶o̶n̶g̶o̶C版自我检查̶p̶y̶m̶o̶n̶g̶o̶.̶h̶a̶s̶_̶c̶（̶）̶是̶F̶a̶l̶s̶e̶.̶ PyMongo C版现已启用，每行悬停约0.0007或0.0008秒。还是很慢。 3.3Ghz intel i3配16GB内存。

这个循环还有什么瓶颈？我几乎必须拥有3个不同的日期，但如果它完全放慢了速度，可以要求摆脱一个

统计数据非常有用，因为它告诉我在这些巨大的集成过程中还剩下多长时间。但是，我猜他们的计算可能会大大减缓事情的发展？可能是所有到终端的打印？

编辑：

我把循环减去实际的插入到cProfile中，这是99999个样本行的结果：

         300001 function calls in 32.061 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   32.061   32.061 <string>:1(<module>)
        1   31.326   31.326   32.061   32.061 cprofiler.py:14(myfunction)
   100000    0.396    0.000    0.396    0.000 {built-in method now}
    99999    0.199    0.000    0.199    0.000 {built-in method utcfromtimestamp}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    99999    0.140    0.000    0.140    0.000 {method 'split' of 'str' objects}

编辑2：由Matt Tenenbaum解决 - 每10,000行左右只有一个解决方案输出到终端。

Answer 1

Delete all indexes存在于集合
如果您的复制比设置w=0, j=0
使用bulk inserts

一些性能测试：http://www.arangodb.org/2012/09/04/bulk-inserts-mongodb-couchdb-arangodb

Answer 2

正如Zagorulkin所说，插入时不要做一堆索引是很重要的，所以要确保没有操作索引。

除此之外，您可能希望将反馈限制为每1000行（或根据您需要输入的行数或您想要的反馈量来确定一些有用的数字）。不是进行所有计算以产生每次迭代的反馈，而是更改代码的最后一个块以使其与count一致，以便它每1000次迭代只传递一次测试：

    #line integration time/stats
    count = count + 1 #increment the counter
    if count % 1000 == 0:
        now = datetime.now()
        diff = now - startTime
        taken = diff.seconds
        avgPerLine = float(taken) / float(count)
        totalTimeLeft = (howManyLinesTotal-count) * avgPerLine
        print "Time left (mins): " + str(totalTimeLeft/60) #output the stats
        print "Avg per line: " + str(avgPerLine)

同样，1000可能不是适合您的号码，但类似的内容会阻止大量此类工作的发生，同时仍然会向您提供您正在寻找的反馈。

在这个循环中加速pymongo插入

2 个答案: