我正在编写一个python程序来解决维基百科转储中的一些问题。
但是,在处理具有大量磁盘使用量的大型数据集时,我注意到的一点是性能几乎总是会随着时间的推移而降低。
我的电脑是核心i7 2.6 GHz,16 Gb ram(使用量达到约5 Gb),配备1 Tb 7200 RPM硬盘。
注意:在这两种情况下,输出都以10秒为增量给出。
这是使用Redis和Python 2.7
[ ] T:251.02 articles/second R: 4628 A: 15474
[ ] T:247.13 articles/second R: 5111 A: 17151
[ ] T:246.41 articles/second R: 5487 A: 19177
[ ] T:258.10 articles/second R: 6200 A: 22217
[ ] T:259.90 articles/second R: 6833 A: 24382
[ ] T:265.22 articles/second R: 7685 A: 26864
[ ] T:274.25 articles/second R: 8981 A: 29488
[ ] T:281.50 articles/second R: 10094 A: 32209
[ ] T:286.51 articles/second R: 11283 A: 34639
[ ] T:296.26 articles/second R: 13033 A: 37414
[ ] T:301.68 articles/second R: 14484 A: 39906
[ ] T:289.22 articles/second R: 14704 A: 40333
[ ] T:277.45 articles/second R: 14940 A: 40634
[ ] T:267.82 articles/second R: 15243 A: 41083
[ ] T:259.04 articles/second R: 15502 A: 41570
[ ] T:250.92 articles/second R: 15778 A: 42014
[ ] T:243.67 articles/second R: 16075 A: 42486
[ ] T:236.79 articles/second R: 16356 A: 42924
[ ] T:230.48 articles/second R: 16649 A: 43358
[ ] T:223.89 articles/second R: 16826 A: 43705
[ ] T:218.44 articles/second R: 17039 A: 44205
[ ] T:213.30 articles/second R: 17234 A: 44705
[ ] T:208.41 articles/second R: 17354 A: 45253
[ ] T:203.60 articles/second R: 17473 A: 45725
[ ] T:199.61 articles/second R: 17627 A: 46329
[ ] T:195.65 articles/second R: 17807 A: 46872
[ ] T:191.64 articles/second R: 17875 A: 47398
[ ] T:188.28 articles/second R: 18003 A: 48008
[ ] T:185.11 articles/second R: 18233 A: 48517
显然Redis可能是我的问题,这里有一些结果没有使用Redis。
[ ] T:1636.31 articles/second R:3938 A:12949
[ ] T:3716.77 articles/second R:19834 A:61210
[ ] T:2776.43 articles/second R:20213 A:68211
[ ] T:2128.70 articles/second R:20228 A:68867
[ ] T:1729.78 articles/second R:20251 A:69586
[ ] T:1462.91 articles/second R:20289 A:70338
[ ] T:1270.07 articles/second R:20309 A:71107
[ ] T:1124.34 articles/second R:20330 A:71857
[ ] T:1011.18 articles/second R:20376 A:72669
[ ] T:919.88 articles/second R:20391 A:73464
[ ] T:845.36 articles/second R:20406 A:74304
[ ] T:783.06 articles/second R:20417 A:75158
[ ] T:730.05 articles/second R:20427 A:75984
[ ] T:684.37 articles/second R:20436 A:76798
[ ] T:645.07 articles/second R:20451 A:77661
[ ] T:610.67 articles/second R:20475 A:78518
这不是'真实'的表现,因为我没有将数据存储在任何地方(只是递增文章和重定向的数量)。但随着时间的推移,我们可以看到同样的表现下降。
程序首次启动时的性能,还是尚未达到稳定性?因为我没有写任何日志文件或任何东西,所以看起来性能应该是相对恒定的,因为我不断从硬盘驱动器读取(授予它将跳转很多以访问所有文件)。
我知道将大量数据放入队列可能是不好的形式,但我认为让单个进程处理数据读取而不是分发文件以读取其他进程导致搜索风暴会更好。我尝试了两种方式(将文件路径放入队列,并将实际数据放入队列中)并将数据放入队列中的速度更快。
from redis import Redis
import time
import re
from multiprocessing import Process, Queue
r = Redis()
r.flushdb()
doubleBrackets = re.compile("\[\[(.*?)\]\]")
def findLinks(q, oq):
while True:
if not q.empty():
title, lines = q.get()
links = []
for line in lines:
for l in doubleBrackets.findall(line):
l = l.split('|')[0]
l = l.strip('|')
links.append(l)
#r.rpush(title, l)
if len(links) == 1:
oq.put(0)
#r.incr('Redirects')
else:
oq.put(1)
#r.incr('Articles')
numArticles = 0
numRedirects = 0
print 'Starting'
# This is a 1 Gb file with the paths to all the files I am accessing
linkFile = '/home/andrew/Wikipedia/logFileAll'
q = Queue()
oq = Queue()
processes = []
for i in range(7):
p = Process(target=findLinks, args=(q,oq))
processes.append(p)
p.start()
startTime = time.time()
timer = time.time()
with open(linkFile, 'rb') as f:
while True:
line = f.readline()
# The data is formatted so the title and path are separated by a single space
title, path = line.split(' ')
with open(path.strip(), 'rb') as fi:
# Here we read the article
lines = fi.readlines()
# We put the title and the article content in the queue
q.put((title, lines))
if time.time() - timer > 10:
# If using Redis
#print '[ ] T:%.2f articles/second R: %s A: %s' %((int(r.get('Redirects'))+int(r.get('Articles')))/(time.time()-startTime), r.get('Redirects'), r.get('Articles'))
# Test for redis dependent performance
while not oq.empty():
response = oq.get()
if response:
numArticles += 1
else:
numRedirects += 1
print '[ ] T:%.2f articles/second R:%s A:%s' %((numArticles+numRedirects)/(time.time()-startTime), numRedirects, numArticles)
timer = time.time()
# When we run through the 1 Gb file, we will still have a couple more items to chew through
while True:
if time.time() - timer > 10:
print '[ ] R: %s A: %s C:%s' %(r.get('Redirects'), r.get('Articles'), title)
timer = time.time()
编辑:从J.F. Sebastian的评论中,我添加了哨兵值而不是q.empty()支票。似乎有些进程卡在了某个地方,但没有抛出异常(会发生奇怪的情况),无论如何,这是性能提升!谢谢!
[ ] T:250.88 articles/second R:663 A:1850 Proc:7
[ ] T:257.17 articles/second R:1216 A:3940 Proc:7
[ ] T:259.92 articles/second R:1820 A:6000 Proc:7
[ ] T:251.81 articles/second R:2337 A:7762 Proc:7
[ ] T:250.04 articles/second R:2943 A:9590 Proc:7
[ ] T:248.24 articles/second R:3543 A:11389 Proc:7
[ ] T:246.83 articles/second R:4060 A:13260 Proc:7
[ ] T:247.59 articles/second R:4583 A:15271 Proc:7
[ ] T:243.97 articles/second R:5074 A:16938 Proc:7
[ ] T:242.01 articles/second R:5440 A:18819 Proc:7
[ ] T:252.34 articles/second R:6086 A:21741 Proc:7
[ ] T:255.94 articles/second R:6738 A:24053 Proc:7
[ ] T:261.38 articles/second R:7547 A:26518 Proc:7
[ ] T:268.01 articles/second R:8617 A:29000 Proc:7
[ ] T:276.48 articles/second R:9933 A:31648 Proc:7
[ ] T:283.45 articles/second R:11114 A:34358 Proc:7
[ ] T:293.25 articles/second R:12836 A:37148 Proc:7
[ ] T:302.41 articles/second R:14567 A:40015 Proc:7
[ ] T:313.33 articles/second R:16553 A:43147 Proc:7
[ ] T:320.35 articles/second R:17699 A:46551 Proc:7
[ ] T:328.72 articles/second R:18966 A:50261 Proc:7
[ ] T:337.07 articles/second R:19645 A:54724 Proc:7
[ ] T:349.34 articles/second R:19820 A:60768 Proc:7
[ ] T:364.98 articles/second R:20190 A:67674 Proc:7
[ ] T:373.08 articles/second R:20384 A:73183 Proc:7
[ ] T:381.27 articles/second R:20495 A:78957 Proc:7
[ ] T:391.39 articles/second R:20960 A:85070 Proc:7
[ ] T:394.74 articles/second R:22194 A:88710 Proc:7
[ ] T:397.37 articles/second R:23525 A:92105 Proc:7
[ ] T:397.76 articles/second R:24882 A:94855 Proc:7
[ ] T:397.11 articles/second R:26138 A:97387 Proc:7