Question

我的笔记本电脑上用于4.2 GB输入文件的此代码的运行时间为48秒。输入文件以制表符分隔，每个值都显示在引号中。每条记录以换行符结尾，例如'"val1"\t"val2"\t"val3"\t..."valn"\n'

我尝试过使用10个线程进行多处理：一个用于排队，一个用于解析各行并填充一个输出队列，一个用于将输出队列缩减为如下所示的defaultdict，但代码耗时300秒跑步，比以下时间长6倍：

from collections import defaultdict
def get_users(log):
    users = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.
    for (i, line) in enumerate(f): 
        if i % 1000000 == 0: print "Line %d" % i # progress notification

        l = line.split('\t')
        if l[ix_profile] != '"7"': # "7" indicates a bad value
            # use list slicing to remove quotes
            users[l[ix_user][1:-1]] += 1 

    f.close()
    return users

我已经通过从for循环删除除了print语句之外的所有内容来检查我是不是受I / O限制。该代码在9秒内运行，我将考虑这个代码运行速度的下限。

我有很多这5 GB的文件要处理，所以即使是运行时的一个很小的改进（我知道，我可以删除打印！）会有所帮助。我运行的机器有4个内核，所以我不禁想知道是否有办法让多线程/多进程代码比上面的代码运行得更快。

更新

我重写了多处理代码如下：

from multiprocessing import Pool, cpu_count
from collections import defaultdict

def parse(line, ix_profile=10, ix_user=9):
    """ix_profile and ix_user predetermined; hard-coding for expedience."""
    l = line.split('\t')
    if l[ix_profile] != '"7"':
        return l[ix_user][1:-1]

def get_users_mp():
    f = open('20110201.txt')
    h = f.readline() # remove header line
    pool = Pool(processes=cpu_count())
    result_iter = pool.imap_unordered(parse, f, 100)
    users = defaultdict(int)
    for r in result_iter:
        if r is not None:
            users[r] += 1
    return users

它在26秒内运行，加速1.85倍。不错，但有4个核心，没有我想象的那么多。

Answer 1

使用正则表达式。

测试确定进程的昂贵部分是对str.split（）的调用。可能不得不为每一行构建一个列表和一堆字符串对象是很昂贵的。

首先，您需要构造一个正则表达式来匹配该行。类似的东西：

expression = re.compile(r'("[^"]")\t("[^"]")\t')

如果你调用expression.match（line）.groups（），你将把前两列作为两个字符串对象提取出来，你可以直接用它们做逻辑。

现在假设感兴趣的两列是前两列。如果不是，您只需调整正则表达式以匹配正确的列。您的代码会检查标头以查看列的位置。你可以根据它生成正则表达式，但我猜这些列实际上总是位于同一个地方。只需验证它们是否仍在那里并在行上使用正则表达式。

修改

来自集合的
导入defaultdict 导入重新

def get_users(log): f = open(log) # Read header line h = f.readline().strip().replace('\'', '').split('\t') ix_profile = h.index('profile.type') ix_user = h.index('profile.id') assert ix_user < ix_profile

此代码假定用户在个人资料之前

keep_field = r'"([^"]*)"'

此正则表达式将捕获单个列

skip_field = r'"[^"]*"'

此正则表达式将匹配列，但不捕获结果。（注意缺少括号）

fields = [skip_field] * len(h) fields[ix_profile] = keep_field fields[ix_user] = keep_field

为所有字段创建一个列表，只保留我们关心的字段

del fields[max(ix_profile, ix_user)+1:]

在我们关心的字段之后删除所有字段（它们需要时间匹配，我们不关心它们）

regex = re.compile(r"\t".join(fields))

实际上产生正则表达式。

users = defaultdict(int) for line in f: user, profile = regex.match(line).groups()

拉出两个值，并执行逻辑

if profile != "7": # "7" indicates a bad value # use list slicing to remove quotes users[user] += 1 f.close() return users

Answer 2

如果你正在运行unix或cygwin，下面的小脚本会产生用户id的频率，其中profile！= 7.应该很快。

使用awk更新以计算用户ID

#!/bin/bash

FILENAME="test.txt"

IX_PROFILE=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.type | cut -f1`
IX_USER=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.id | cut -f1`
# Just the userids
# sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2

# userids counted:
# sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | sort | uniq -c

# Count using awk..?
sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | awk '{ count[$1]++; } END { for (x in count) { print x "\t" count[x] } }'

Answer 3

看到您的日志文件以制表符分隔，您可以使用csv模块 - 带有dialect='excel-tab'参数 - 以获得良好的性能和可读性提升。当然，如果你必须使用Python而不是更快的控制台命令。

Answer 4

如果使用正则表达式可以通过忽略不需要拆分的行的尾部来加速它，那么更简单的方法可能会有所帮助：

[snip)
ix_profile = h.index('profile.type')
ix_user = h.index('profile.id')
maxsplits = max(ix_profile, ix_user) + 1 #### new statement ####
# If either ix_* is the last field in h, it will include a newline. 
# That's fine for now.
for (i, line) in enumerate(f): 
    if i % 1000000 == 0: print "Line %d" % i # progress notification
    l = line.split('\t', maxsplits) #### changed line ####
[snip]

请对您的数据进行调整。

Answer 5

也许你可以做到

users[l[ix_user]] += 1

而不是

users[l[ix_user][1:-1]] += 1

并删除末尾dict上的引号。应该节省一些时间。

对于多线程方法：尝试每次从文件中读取几千行，并将这几千行传递给要处理的线程。逐行进行似乎是太多的开销。

或者阅读this article中的解决方案，因为他似乎正在做一些与您尝试做的非常类似的事情。

Answer 6

除了这一点之外，这可能略有不同，但Python在处理多个线程时有一些非常奇怪的行为（当线程不是IO绑定时尤其糟糕）。更具体地说，它有时比单线程运行慢得多。这是由于Python中的全局解释器锁（GIL）被用于确保在任何给定时间在Python解释器中只能执行一个以上线程的方式。

由于在任何给定时间只有一个线程可以实际使用解释器的约束，因此您拥有多个核心这一事实对您无济于事。实际上，由于尝试获取GIL的两个线程之间的某些病理交互，实际上可能会使事情变得更糟。如果你想坚持Python，你有两个选择之一：

尝试使用Python 3.2（或更高版本，3.0将无效）。它有一种非常不同的处理GIL的方式，在许多情况下修复了多线程减速问题。我假设你没有使用Python 3系列，因为你使用旧的print语句。
使用进程代替线程。由于进程共享打开的文件描述符，因此一旦实际开始吃文件，就不需要在进程之间传递任何状态（如果确实需要，可以使用管道或消息）。这会稍微增加初始启动时间，因为进程创建的时间比线程要多，但是您可以避免GIL问题。

如果你想了解更多关于这个神奇的Python的信息，请在这个页面上查看与GIL相关的会谈：http://www.dabeaz.com/talks.html。

Answer 7

我意识到我与Winston Ewert几乎完全相同：构建一个正则表达式。

但是我的正则表达式：

完成ix_profile < ix_user以及ix_profile > ix_user
正则表达式仅捕获用户的列：如果此列中存在“7”，则配置文件的列与子模式'"(?!7")[^\t\r\n"]*"'匹配，该子模式不匹配;所以我们只获得了唯一定义了

此外，我测试了几种匹配和提取算法：

1）使用 re.finditer（）

2） re.match（），正则表达式匹配 40个字段

3） re.match（）且正则表达式仅匹配 max（ix_profile，ix_user）+ 1个字段

4）喜欢3 ，但使用简单字典而不是defaultdict实例

要测量时间，我的代码会根据您提供的有关其内容的信息创建一个文件。

我在4个代码中测试了以下4个函数：

1

def get_users_short_1(log):
    users_short = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = 40*['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t"]+"'
    glo[ix_user] = '"([^\t"]*)"'
    glo[39] = '"[^\t\r\n]*"'
    regx = re.compile('^'+'\t'.join(glo),re.MULTILINE)

    content = f.read()
    for mat in regx.finditer(content):
        users_short[mat.group(1)] += 1

    f.close()
    return users_short

2

def get_users_short_2(log):
    users_short = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = 40*['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t"]*"'
    glo[ix_user] = '"([^\t"]*)"'
    regx = re.compile('\t'.join(glo))


    for line in f:
        gugu = regx.match(line)
        if gugu:
            users_short[gugu.group(1)] += 1
    f.close()
    return users_short

3

def get_users_short_3(log):
    users_short = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = (max(ix_profile,ix_user) + 1) * ['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t"]*"'
    glo[ix_user] = '"([^\t"]*)"'
    regx = re.compile('\t'.join(glo))

    for line in f:
        gugu = regx.match(line)
        if gugu:
            users_short[gugu.group(1)] += 1

    f.close()
    return users_short

4

完整的代码4，似乎是最快的：

import re
from random import choice,randint,sample
import csv
import random
from time import clock

choi = 1
if choi:
    ntot = 1000
    chars = 'abcdefghijklmnopqrstuvwxyz0123456789'
    def ry(a=30,b=80,chars=chars,nom='abcdefghijklmnopqrstuvwxyz'):
        if a==30:
            return ''.join(choice(chars) for i in xrange(randint(30,80)))
        else:
            return ''.join(choice(nom) for i in xrange(randint(8,12)))

    num = sample(xrange(1000),200)
    num.sort()
    print 'num==',num
    several = [e//3 for e in xrange(0,800,7) if e//3 not in num]
    print
    print 'several==',several

    with open('biggy.txt','w') as f:
        head = ('aaa','bbb','ccc','ddd','profile.id','fff','ggg','hhhh','profile.type','iiii',
                'jjj','kkkk','lll','mmm','nnn','ooo','ppp','qq','rr','ss',
                'tt','uu','vv','ww','xx','yy','zz','razr','fgh','ty',
                'kfgh','zer','sdfs','fghf','dfdf','zerzre','jkljkl','vbcvb','kljlk','dhhdh')
        f.write('\t'.join(head)+'\n')
        for i in xrange(1000):
            li = [ ry(a=8).join('""') if n==4 else ry().join('""')
                   for n in xrange(40) ]
            if i in num:
                li[4] = '@#~&=*;'
                li[8] = '"7"'
            if i in several:
                li[4] = '"BRAD"'
            f.write('\t'.join(li)+'\n')



from collections import defaultdict
def get_users(log):
    users = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.
    for (i, line) in enumerate(f): 
        #if i % 1000000 == 0: print "Line %d" % i # progress notification

        l = line.split('\t')
        if l[ix_profile] != '"7"': # "7" indicates a bad value
            # use list slicing to remove quotes

            users[l[ix_user][1:-1]] += 1 
    f.close()
    return users




def get_users_short_4(log):
    users_short = {}
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = (max(ix_profile,ix_user) + 1) * ['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t"]*"'
    glo[ix_user] = '"([^\t"]*)"'
    regx = re.compile('\t'.join(glo))

    for line in f:
        gugu = regx.match(line)
        if gugu:
            gugugroup = gugu.group(1)
            if gugugroup in users_short:
                users_short[gugugroup] += 1
            else:
                users_short[gugugroup] = 1

    f.close()
    return users_short




print '\n\n'

te = clock()
USERS = get_users('biggy.txt')
t1 = clock()-te

te = clock()
USERS_short_4 = get_users_short_4('biggy.txt')
t2 = clock()-te



if choi:
    print '\nlen(num)==',len(num),' : number of lines with ix_profile==\'"7"\''
    print "USERS['BRAD']==",USERS['BRAD']
    print 'then :'
    print str(ntot)+' lines - '+str(len(num))+' incorrect - '+str(len(several))+\
          ' identical + 1 user BRAD = '+str(ntot - len(num)-len(several)+1)    
print '\nlen(USERS)==',len(USERS)
print 'len(USERS_short_4)==',len(USERS_short_4)
print 'USERS == USERS_short_4 is',USERS == USERS_short_4

print '\n----------------------------------------'
print 'time of get_users() :\n', t1,'\n----------------------------------------'
print 'time of get_users_short_4 :\n', t2,'\n----------------------------------------'
print 'get_users_short_4() / get_users() = '+str(100*t2/t1)+ ' %'
print '----------------------------------------'

此代码4的一个结果是例如：

num== [2, 12, 16, 23, 26, 33, 38, 40, 43, 45, 51, 53, 84, 89, 93, 106, 116, 117, 123, 131, 132, 135, 136, 138, 146, 148, 152, 157, 164, 168, 173, 176, 179, 189, 191, 193, 195, 199, 200, 208, 216, 222, 224, 227, 233, 242, 244, 245, 247, 248, 251, 255, 256, 261, 262, 266, 276, 278, 291, 296, 298, 305, 307, 308, 310, 312, 314, 320, 324, 327, 335, 337, 340, 343, 350, 356, 362, 370, 375, 379, 382, 385, 387, 409, 413, 415, 419, 433, 441, 443, 444, 446, 459, 462, 474, 489, 492, 496, 505, 509, 511, 512, 518, 523, 541, 546, 548, 550, 552, 558, 565, 566, 572, 585, 586, 593, 595, 601, 609, 610, 615, 628, 632, 634, 638, 642, 645, 646, 651, 654, 657, 660, 662, 665, 670, 671, 680, 682, 687, 688, 690, 692, 695, 703, 708, 716, 717, 728, 729, 735, 739, 741, 742, 765, 769, 772, 778, 790, 792, 797, 801, 808, 815, 825, 828, 831, 839, 849, 858, 859, 862, 864, 872, 874, 890, 899, 904, 906, 913, 916, 920, 923, 928, 941, 946, 947, 953, 955, 958, 959, 961, 971, 975, 976, 979, 981, 985, 989, 990, 999]

several== [0, 4, 7, 9, 11, 14, 18, 21, 25, 28, 30, 32, 35, 37, 39, 42, 44, 46, 49, 56, 58, 60, 63, 65, 67, 70, 72, 74, 77, 79, 81, 86, 88, 91, 95, 98, 100, 102, 105, 107, 109, 112, 114, 119, 121, 126, 128, 130, 133, 137, 140, 142, 144, 147, 149, 151, 154, 156, 158, 161, 163, 165, 170, 172, 175, 177, 182, 184, 186, 196, 198, 203, 205, 207, 210, 212, 214, 217, 219, 221, 226, 228, 231, 235, 238, 240, 249, 252, 254, 259, 263]




len(num)== 200  : number of lines with ix_profile=='"7"'
USERS['BRAD']== 91
then :
1000 lines - 200 incorrect - 91 identical + 1 user BRAD = 710

len(USERS)== 710
len(USERS_short_4)== 710
USERS == USERS_short_4 is True

----------------------------------------
time of get_users() :
0.0788686830309 
----------------------------------------
time of get_users_short_4 :
0.0462885646081 
----------------------------------------
get_users_short_4() / get_users() = 58.690677756 %
----------------------------------------

但结果或多或少是变数。我获得了：

get_users_short_1() / get_users() = 82.957476637 %
get_users_short_1() / get_users() = 82.3987686867 %
get_users_short_1() / get_users() = 90.2949842932 %
get_users_short_1() / get_users() = 78.8063007461 %
get_users_short_1() / get_users() = 90.4743181768 %
get_users_short_1() / get_users() = 81.9635560003 %
get_users_short_1() / get_users() = 83.9418269406 %
get_users_short_1() / get_users() = 89.4344442255 %


get_users_short_2() / get_users() = 80.4891442088 %
get_users_short_2() / get_users() = 69.921943776 %
get_users_short_2() / get_users() = 81.8006709304 %
get_users_short_2() / get_users() = 83.6270772928 %
get_users_short_2() / get_users() = 97.9821084403 %
get_users_short_2() / get_users() = 84.9307558629 %
get_users_short_2() / get_users() = 75.9384820018 %
get_users_short_2() / get_users() = 86.2964748485 %


get_users_short_3() / get_users() = 69.4332754744 %
get_users_short_3() / get_users() = 58.5814726668 %
get_users_short_3() / get_users() = 61.8011476831 %
get_users_short_3() / get_users() = 67.6925083362 %
get_users_short_3() / get_users() = 65.1208124156 %
get_users_short_3() / get_users() = 72.2621727569 %
get_users_short_3() / get_users() = 70.6957501222 %
get_users_short_3() / get_users() = 68.5310031226 %
get_users_short_3() / get_users() = 71.6529128259 %
get_users_short_3() / get_users() = 71.6153554073 %
get_users_short_3() / get_users() = 64.7899044975 %
get_users_short_3() / get_users() = 72.947531363 %
get_users_short_3() / get_users() = 65.6691965629 %
get_users_short_3() / get_users() = 61.5194374401 %
get_users_short_3() / get_users() = 61.8396133666 %
get_users_short_3() / get_users() = 71.5447862466 %
get_users_short_3() / get_users() = 74.6710538858 %
get_users_short_3() / get_users() = 72.9651233485 %



get_users_short_4() / get_users() = 65.5224210767 %
get_users_short_4() / get_users() = 65.9023813161 %
get_users_short_4() / get_users() = 62.8055210129 %
get_users_short_4() / get_users() = 64.9690049062 %
get_users_short_4() / get_users() = 61.9050866134 %
get_users_short_4() / get_users() = 65.8127125992 %
get_users_short_4() / get_users() = 66.8112344201 %
get_users_short_4() / get_users() = 57.865635278 %
get_users_short_4() / get_users() = 62.7937713964 %
get_users_short_4() / get_users() = 66.3440149528 %
get_users_short_4() / get_users() = 66.4429530201 %
get_users_short_4() / get_users() = 66.8692388625 %
get_users_short_4() / get_users() = 66.5949137537 %
get_users_short_4() / get_users() = 69.1708488794 %
get_users_short_4() / get_users() = 59.7129743801 %
get_users_short_4() / get_users() = 59.755297387 %
get_users_short_4() / get_users() = 60.6436352185 %
get_users_short_4() / get_users() = 64.5023727945 %
get_users_short_4() / get_users() = 64.0153937511 %

我想知道你的真实文件中的代码会用一台比我强大的计算机获得什么样的结果。请给我新闻。

编辑1

使用

def get_users_short_Machin(log):
    users_short = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    maxsplits = max(ix_profile, ix_user) + 1
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.
    for line in f: 
        #if i % 1000000 == 0: print "Line %d" % i # progress notification
        l = line.split('\t', maxsplits)
        if l[ix_profile] != '"7"': # "7" indicates a bad value
            # use list slicing to remove quotes
            users_short[l[ix_user][1:-1]] += 1 
    f.close()
    return users_short

我有

get_users_short_Machin() / get_users() = 60.6771821308 %
get_users_short_Machin() / get_users() = 71.9300992989 %
get_users_short_Machin() / get_users() = 85.1695214715 %
get_users_short_Machin() / get_users() = 72.7722233685 %
get_users_short_Machin() / get_users() = 73.6311173237 %
get_users_short_Machin() / get_users() = 86.0848484053 %
get_users_short_Machin() / get_users() = 75.1661981729 %
get_users_short_Machin() / get_users() = 72.8888452474 %
get_users_short_Machin() / get_users() = 76.7185685993 %
get_users_short_Machin() / get_users() = 82.7007096958 %
get_users_short_Machin() / get_users() = 71.1678957888 %
get_users_short_Machin() / get_users() = 71.9845835126 %

使用简单的词典：

users_short = {}
.......
for line in f: 
    #if i % 1000000 == 0: print "Line %d" % i # progress notification
    l = line.split('\t', maxsplits)
    if l[ix_profile] != '"7"': # "7" indicates a bad value
        # use list slicing to remove quotes
        us = l[ix_user][1:-1]
        if us not in users_short:
            users_short[us] = 1
        else:
            users_short[us] += 1

稍微改善了执行的时间，但它仍然高于我上一次的代码4

get_users_short_Machin2() / get_users() = 71.5959919389 %
get_users_short_Machin2() / get_users() = 71.6118864535 %
get_users_short_Machin2() / get_users() = 66.3832514274 %
get_users_short_Machin2() / get_users() = 68.0026407277 %
get_users_short_Machin2() / get_users() = 67.9853921552 %
get_users_short_Machin2() / get_users() = 69.8946203037 %
get_users_short_Machin2() / get_users() = 71.8260030248 %
get_users_short_Machin2() / get_users() = 78.4243267003 %
get_users_short_Machin2() / get_users() = 65.7223734428 %
get_users_short_Machin2() / get_users() = 69.5903935612 %

编辑2

最快的：

def get_users_short_CSV(log):
    users_short = {}
    f = open(log,'rb')
    rid = csv.reader(f,delimiter='\t')
    # Read header line
    h = rid.next()
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.

    glo = (max(ix_profile,ix_user) + 1) * ['[^\t]*']
    glo[ix_profile] = '"(?!7")[^\t\r\n"]*"'
    glo[ix_user] = '"([^\t\r\n"]*)"'
    regx = re.compile('\t'.join(glo))

    for line in f:
        gugu = regx.match(line)
        if gugu:
            gugugroup = gugu.group(1)
            if gugugroup in users_short:
                users_short[gugugroup] += 1
            else:
                users_short[gugugroup] = 1

    f.close()
    return users_short

结果

get_users_short_CSV() / get_users() = 31.6443901114 %
get_users_short_CSV() / get_users() = 44.3536176134 %
get_users_short_CSV() / get_users() = 47.2295100511 %
get_users_short_CSV() / get_users() = 45.4912200716 %
get_users_short_CSV() / get_users() = 63.7997241038 %
get_users_short_CSV() / get_users() = 43.5020255488 %
get_users_short_CSV() / get_users() = 40.9188320386 %
get_users_short_CSV() / get_users() = 43.3105062139 %
get_users_short_CSV() / get_users() = 59.9184895288 %
get_users_short_CSV() / get_users() = 40.22047881 %
get_users_short_CSV() / get_users() = 48.3615872543 %
get_users_short_CSV() / get_users() = 47.0374831251 %
get_users_short_CSV() / get_users() = 44.5268626789 %
get_users_short_CSV() / get_users() = 53.1690205938 %
get_users_short_CSV() / get_users() = 43.4022458372 %

编辑3

我测试了 get_users_short_CSV（），文件中包含10000行而不是1000行：

len(num)== 2000  : number of lines with ix_profile=='"7"'
USERS['BRAD']== 95
then :
10000 lines - 2000 incorrect - 95 identical + 1 user BRAD = 7906

len(USERS)== 7906
len(USERS_short_CSV)== 7906
USERS == USERS_short_CSV is True

----------------------------------------
time of get_users() :
0.794919186656 
----------------------------------------
time of get_users_short_CSV :
0.358942826532 
----------------------------------------
get_users_short_CSV() / get_users() = 41.5618307521 %

get_users_short_CSV() / get_users() = 42.2769300584 %
get_users_short_CSV() / get_users() = 45.154631132 %
get_users_short_CSV() / get_users() = 44.1596819482 %
get_users_short_CSV() / get_users() = 30.3192350266 %
get_users_short_CSV() / get_users() = 34.4856637748 %
get_users_short_CSV() / get_users() = 43.7461535628 %
get_users_short_CSV() / get_users() = 41.7577246935 %
get_users_short_CSV() / get_users() = 41.9092878608 %
get_users_short_CSV() / get_users() = 44.6772360665 %
get_users_short_CSV() / get_users() = 42.6770989413 %

优化这个python日志解析代码

7 个答案:

1

2

3

4

编辑1

编辑2

编辑3