首先在python中收集数据以进行操作

时间:2016-12-28 07:24:12

标签: python python-performance

最近,我做了一个测试。我遇到了以下问题,我必须匹配logdataexpected_result。代码如下,用我的解决方案编辑:

import collections


log_data = """1.1.2014 12:01,111-222-333,454-333-222,COMPLETED
1.1.2014 13:01,111-222-333,111-333,FAILED
1.1.2014 13:04,111-222-333,454-333-222,FAILED
1.1.2014 13:05,111-222-333,454-333-222,COMPLETED
2.1.2014 13:01,111-333,111-222-333,FAILED
"""

expected_result = {
    "111-222-333": "40.00%",
    "454-333-222": "66.67%",
    "111-333" : "0.00%"
}

def compute_success_ratio(logdata):
    #! better option to use .splitlines()
    #! or even better recognize the CSV structure and use csv.reader
    entries = logdata.split('\n')
    #! interesting choice to collect the data first
    #! which could result in explosive growth of memory hunger, are there
    #! alternatives to this structure?
    complst = []
    faillst = []
    #! probably no need for attaching `lst` to the variable name, no? 

    for entry in entries:
        #! variable naming could be clearer here
        #! a good way might involve destructuring the entry like:
        #! _, caller, callee, result
        #! which also avoids using magic indices further down (-1, 1, 2)
        ent = entry.split(',')
        if ent[-1] == 'COMPLETED':
            #! complst.extend(ent[1:3]) for even more brevity
            complst.append(ent[1])
            complst.append(ent[2])
        elif ent[-1] == 'FAILED':
            faillst.append(ent[1])
            faillst.append(ent[2])

    #! variable postfix `lst` could let us falsely assume that the result of set()
    #! is a list.
    numlst = set(complst + faillst)

    #! good use of collections.Counter,
    #! but: Counter() already is a dictionary, there is no need to convert it to one
    comps = dict(collections.Counter(complst))
    fails = dict(collections.Counter(faillst))
    #! variable naming overlaps with global, and doesn't make sense in this context
    expected_result = {}

    for e in numlst:
        #! good: dealt with possibility of a number not showing up in `comps` or `fails`
        #! bad: using a try/except block to deal with this when a simpler .get("e", 0)
        #! would've allowed dealing with this more elegantly
        try:
            #! variable naming not very expressive
            rat = float(comps[e]) / float(comps[e] + fails[e]) * 100
            perc = round(rat, 2)
            #! here we are rounding twice, and then don't use the formatting string
            #! to attach the % -- '{:.2f}%'.format(perc) would've been the right
            #! way if one doesn't know percentage formatting (see below)
            expected_result[e] = "{:.2f}".format(perc) + '%'
            #! a generally better way would be to either
            #! from __future__ import division
            #! or to compute the ratio as 
            #! ratio = float(comps[e]) / (comps[e] + fails[e])
            #! and then use percentage formatting for the ratio
            #! "{:.2%}".format(ratio) 
        except KeyError:
            expected_result[e] = '0.00%'

    return expected_result

if __name__ == "__main__":
    assert(compute_success_ratio(log_data) == expected_result)

#! overall
#! + correct 
#! ~ implementation not optimal, relatively wasteful in terms of memory 
#! - variable naming inconsistent, overly shortened, not expressive
#! - some redundant operations
#! + good use of standard library collections.Counter
#! ~ code could be a tad bit more idiomatic

我已经理解了一些问题,例如变量命名约定和尽可能避免try_block部分。但是,我无法理解如何使用csv.reader改进代码。另外,我怎么理解关于首先收集数据的评论呢?有哪些替代方案?有人可以对这两个问题有所了解吗?

3 个答案:

答案 0 :(得分:2)

执行entries = logdata.split('\n')时,您将创建一个包含拆分字符串的列表。由于日志文件可能非常大,因此会消耗大量内存。

csv.reader的工作方式是打开文件,一次只能读取一行(大约)。这意味着数据保留在文件中,并且只有一行在内存中。

忘记csv解析一分钟,这些方法之间的差异说明了这个问题:

在方法1中,我们将整个文件读入内存:

data = open('logfile').read().split('\n')
for line in data:
   # do something with the line

在方法2中,我们一次读取一行:

data = open('logfile')
for line in data:
    # do something with the line

方法1将消耗更多内存,因为整个文件需要读入内存。它还遍历数据两次 - 一次是在我们读取数据时,一次是分成行。方法2的缺点是我们只能通过data进行一次循环。

对于这里的特定情况,我们不是从文件中读取,而是从已经在内存中的变量读取,最大的区别在于我们将使用拆分方法消耗大约两倍的内存。 / p>

答案 1 :(得分:1)

split('\n')splitlines将创建数据副本,其中每一行都是列表中的单独项目。由于您只需要将数据传递一次而不是随机访问线路,因此与CSV读取器相比,这会浪费许多时间。使用阅读器的另一个好处是,您不必手动将数据拆分为行和行到列。

有关数据收集的注释指的是您将所有已完成和失败的项目添加到两个列表中。让我们说项目111-333完成五次并失败两次。您的数据看起来像这样:

complst = ['111-333', '111-333', '111-333', '111-333', '111-333']
faillst = ['111-333', '111-333']

您不需要那些重复项目,因此您可以直接使用Counter而无需将项目收集到列表中并节省大量内存。

这是一个使用csv.reader并收集成功的替代实施方案。失败计为dict,其中项目名称为关键,值为列表[success count, failure count]

from collections import defaultdict
import csv
from io import StringIO

log_data = """1.1.2014 12:01,111-222-333,454-333-222,COMPLETED
1.1.2014 13:01,111-222-333,111-333,FAILED
1.1.2014 13:04,111-222-333,454-333-222,FAILED
1.1.2014 13:05,111-222-333,454-333-222,COMPLETED
2.1.2014 13:01,111-333,111-222-333,FAILED
"""

RESULT_STRINGS = ['COMPLETED', 'FAILED']
counts = defaultdict(lambda: [0, 0])
for _, *params, result in csv.reader(StringIO(log_data)):
    try:
        index = RESULT_STRINGS.index(result)
        for param in params:
            counts[param][index] += 1
    except ValueError:
        pass # Skip line in case last column is not in RESULT_STRINGS

result = {k: '{0:.2f}%'.format(v[0] / sum(v) * 100) for k, v in counts.items()} 

请注意,上述内容仅适用于Python 3。

答案 2 :(得分:0)

或者,如果你可以使用它,Pandas看起来是一个很好的解决方案。

import pandas as pd

log_data = pd.read_csv('data.csv',header=None)
log_data.columns = ['date', 'key1','key2','outcome']

meltedData = pd.melt(log_data, id_vars=['date','outcome'], value_vars=['key1','key2'],
              value_name = 'key') # we transpose the keys here
meltedData['result'] = [int(x.lower() == 'completed') for x in meltedData['outcome']] # add summary variable

groupedData = meltedData.groupby(['key'])['result'].mean()
groupedDict = groupedData.to_dict()

print groupedDict

结果:

{'111-333': 0.0, '111-222-333': 0.40000000000000002, '454-333-222': 0.66666666666666663}