当读取具有特定字符的特定数据时,使用string.count()进行大幅减速

时间:2016-11-09 10:20:45

标签: python string performance

我需要阅读一个大型的csv数据文件,但该文件中充斥着换行符并且通常非常混乱。所以我不是手动操作,而是手动操作,但是我遇到了一个奇怪的减速,这似乎取决于文件中出现的字符。

尝试通过随机创建看起来相似的csv文件来重新创建问题时,我认为可能问题出在count函数中。

考虑这个例子,它创建一个带有混乱随机数据的大文件,读取文件,然后使用count命令,使其可以作为柱状数据读取。

请注意,在文件的第一次运行中,我只使用string.ascii_letters作为随机数据,第二次运行我使用string.printable中的字符。

import os
import random as rd
import string
import time

# Function to create random data in a specific pattern with separator ";":
def createRandomString(num,io,fullLength):
    lineFull = ''
    nl = True
    randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(7))
    for i in range(num):
        if i == 0:
            line = 'Start;'
        else:
            line = ''
            bb = rd.choice([True,True,False])
            if bb:
                line = line+'\"\";'
            else:
                if rd.random() < 0.999:
                    line = line+randstr
                else:
                    line = line+rd.randint(10,100)*randstr
                if nl and i != num-1:
                    line = line+';\n'
                    nl = False
                elif rd.random() < 0.04 and i != num-1:
                    line = line+';\n'
                    if rd.random() < 0.01:
                        add = rd.randint(1,10)*'\n'
                        line = line+add
                else:
                    line = line+';'
        lineFull = lineFull+line
    return lineFull+'\n'

# Create file with random data:
outputFolder = "C:\\DataDir\\Output\\"
numberOfCols = 38
fullLength = 10000
testLines = [createRandomString(numberOfCols,i,fullLength) for i in range(fullLength)]
with open(outputFolder+"TestFile.txt",'w') as tf:
    tf.writelines(testLines)

# Read in file:
with open(outputFolder+"TestFile.txt",'r') as ff:
    lines = []
    for line in ff.readlines():
        lines.append(unicode(line.rstrip('\n')))

# Restore columns by counting the separator:
linesT = ''
lines2 = []
time0 = time.time()
for i in range(len(lines)):
    linesT = linesT + lines[i]
    count = linesT.count(';')
    if count == numberOfCols:
        lines2.append(linesT)
        linesT = ''
    if i%1000 == 0:
        print time.time()-time0
        time0 = time.time()
print time.time()-time0

print语句输出:

0.0
0.0019998550415
0.00100016593933
0.000999927520752
0.000999927520752
0.000999927520752
0.000999927520752
0.00100016593933
0.0019998550415
0.000999927520752
0.00100016593933
0.0019998550415
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752

持续快速的表现。

现在我将createRandomString中的第三行更改为randstr = ''.join(rd.choice(string.printable) for _ in range(7)),我的输出现在变为:

0.0
0.0759999752045
0.273000001907
0.519999980927
0.716000080109
0.919999837875
1.11500000954
1.25199985504
1.51200008392
1.72199988365
1.8820002079
2.07999992371
2.21499991417
2.37400007248
2.64800000191
2.81900000572
3.04500007629
3.20299983025
3.55500006676
3.6930000782
3.79499983788
4.13900017738
4.19899988174
4.58700013161
4.81799983978
4.92000007629
5.2009999752
5.40199995041
5.48399996758
5.70299983025
5.92300009727
6.01099991798
6.44200015068
6.58999991417
3.99399995804

不仅性能非常缓慢,而且随着时间的推移它会一直变慢。

唯一的区别在于写入随机数据的字符范围。

我的真实数据中出现的完整字符是:

charSet = [' ','"','&',"'",'(',')','*','+',',','-','.','/','0','1','2','3','4','5','6',
           '7','8','9',':',';','<','=','>','A','B','C','D','E','F','G','H','I','J','K',
           'L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','\\','_','`','a',
           'b','d','e','g','h','i','l','m','n','o','r','s','t','x']

让我们对count - 函数进行一些基准测试:

import random as rd
rd.seed()

def Test0():
    randstr = ''.join(rd.choice(string.digits) for _ in range(10000))
    randstr.count('7')

def Test1():
    randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(10000))
    randstr.count('a')

def Test2():
    randstr = ''.join(rd.choice(string.printable) for _ in range(10000))
    randstr.count(';')

def Test3():
    randstr = ''.join(rd.choice(charSet) for _ in range(10000))
    randstr.count(';')

我只测试数字,只测试字母,可打印和我的数据中的字符集。

%timeit的结果:

%timeit(Test0())
100 loops, best of 3: 9.27 ms per loop
%timeit(Test1())
100 loops, best of 3: 9.12 ms per loop
%timeit(Test2())
100 loops, best of 3: 9.94 ms per loop
%timeit(Test3())
100 loops, best of 3: 8.31 ms per loop

性能是一致的,并不表示count对某些字符集的任何问题。

我还测试过,如果用+连接字符串会导致速度变慢,但情况并非如此。

任何人都可以解释这个或给我一些提示吗?

编辑:使用Python 2.7.12

编辑2:在我的原始数据中发生以下情况:

该文件有大约550000行,这些行经常被随机换行符分解,但仍由38 ";" - 分隔符定义。直到300000行的性能很快,然后从它上面突然开始变得越来越慢。我现在用新的线索进一步调查这一点。

2 个答案:

答案 0 :(得分:2)

问题出在count(';')

string.printable包含';',而string.ascii_characters则不包含。{/ p>

然后随着linesT的长度增加,执行时间也会增加:

0.000236988067627
0.0460968017578
0.145275115967
0.271568059921
0.435608148575
0.575787067413
0.750104904175
0.899538993835
1.08505797386
1.24447107315
1.34459710121
1.45430088043
1.63317894936
1.90502595901
1.92841100693
2.07722711563
2.16924905777
2.30753016472

Faulty runtime behaviour 特别是此代码在string.printable

中存在问题
 numberOfCols = 38
 if count == numberOfCols:
        lines2.append(linesT)
        linesT = ''

由于在';'被刷新之前,第37行可能会多次linesT包含38,因此会跳过linesTstring.ascii_characters无限增长。< / p>

您可以将初始设置保留为count('a')并将代码更改为if count > numberOfCols: 来观察此行为。

要解决printtable问题,您可以像这样修改代码:

0.000234842300415
0.00233697891235
0.00247097015381
0.00217199325562
0.00262403488159
0.00262403488159
0.0023078918457
0.0024049282074
0.00231409072876
0.00233006477356
0.00214791297913
0.0028760433197
0.00241804122925
0.00250506401062
0.00254893302917
0.00266218185425
0.00236296653748
0.00201988220215
0.00245118141174
0.00206398963928
0.00219988822937
0.00230193138123
0.00205302238464
0.00230097770691
0.00248003005981
0.00204801559448

然后我们回到预期的运行时行为:

  .event-image {
    margin-bottom: 30px;
    width: auto;
    width: 500px;
    height: 178px;
    max-width: 100%;
    min-width: 300px;
    background-size: cover;
    background-position: center center;
    background-repeat: no-repeat;
    outline: none;
  }

答案 1 :(得分:1)

我只是在举报我发现的内容。性能差异似乎不是来自str.count()函数。我更改了您的代码并将str.count()重构为自己的函数。我还将您的全局代码放入主函数中。以下是我的代码版本:

import os
import time
import random as rd
import string
import timeit

# Function to create random data in a specific pattern with separator ";":
def createRandomString(num,io,fullLength):
    lineFull = ''
    nl = True
    randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(7))
    #randstr = ''.join(rd.choice(string.printable) for _ in range(7))
    for i in range(num):
        if i == 0:
            line = 'Start;'
        else:
            line = ''
            bb = rd.choice([True,True,False])
            if bb:
                line = line+'\"\";'
            else:
                if rd.random() < 0.999:
                    line = line+randstr
                else:
                    line = line+rd.randint(10,100)*randstr
                if nl and i != num-1:
                    line = line+';\n'
                    nl = False
                elif rd.random() < 0.04 and i != num-1:
                    line = line+';\n'
                    if rd.random() < 0.01:
                        add = rd.randint(1,10)*'\n'
                        line = line+add
                else:
                    line = line+';'
        lineFull = lineFull+line
    return lineFull+'\n'


def counting_func(lines_iter):
    try:
        return lines_iter.next().count(';')
    except StopIteration:
        return -1


def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped


# Create file with random data:
def main():
    fullLength = 100000
    outputFolder = ""
    numberOfCols = 38
    testLines = [createRandomString(numberOfCols,i,fullLength) for i in range(fullLength)]
    with open(outputFolder+"TestFile.txt",'w') as tf:
        tf.writelines(testLines)

    # Read in file:
    with open(outputFolder+"TestFile.txt",'r') as ff:
        lines = []
        for line in ff.readlines():
            lines.append(unicode(line.rstrip('\n')))

    # Restore columns by counting the separator:
    lines_iter = iter(lines)
    print timeit.timeit(wrapper(counting_func, lines_iter), number=fullLength)


if __name__ == '__main__': main()

在生成的每一行上进行100000次测试。使用string.ascii_letters,我会从每个循环平均0.0454177856445秒获得timeit。使用string.printable,我得到平均0.0426299571991。事实上,后者比前者略快,但并不是真正的显着差异。

我怀疑性能差异来自于你在以下循环中所做的事情,除了计算:

for i in range(len(lines)):
    linesT = linesT + lines[i]
    count = linesT.count(';')
    if count == numberOfCols:
        lines2.append(linesT)
        linesT = ''
    if i%1000 == 0:
        print time.time()-time0
        time0 = time.time()

另一种可能性是在没有主函数的情况下访问全局变量的速度变慢。但这种情况应该发生在两种情况下,所以不是真的。