平均字典中多个值之间的差异

时间:2015-12-29 18:09:53

标签: python

我有一个带有2列的制表符分隔文本文件,Bill to Name和Date,date是Excel数字格式。代码......

import csv
from collections import defaultdict

d = defaultdict( list )

input_file = "C:\\Users\\Intern\\Documents\\Python.txt"
output_file = "C:\\Users\\Intern\\Documents\\b.csv"

with open( input_file, 'r') as infile:
    reader = csv.reader(infile, delimiter='\t')
    next(reader, None)  # skip the header
    for row in reader:
        d[ row[0] ].append( int(row[1]) )

with open( output_file, 'w' ) as outfile:
    writer = csv.writer(outfile, delimiter='\t')
    for key, value in d.items():
    if len(value) == 1:
        avg_diff = None # or 0 -- this indicates there was only 1 purchase
    else:
    # This requires your dates to be sorted, ascending, but that just takes
    # wrapping 'value' in 'sorted' if it isn't sorted yet
        avg_diff = mean([v[i] - v[i-1] for i, v in enumerate(value) if i])
    writer.writerow( [key, avg_diff] )

当前错误:

TypeError Traceback (most recent call last) <ipython-input-2-1e819db94549> in <module>() 22 # This requires your dates to be sorted, ascending, but that just takes 23 # wrapping 'value' in 'sorted' if it isn't sorted yet ---> 24 avg_diff = mean([v[i] - v[i-1] for i, v in enumerate(value) if i]) 25 writer.writerow( [key, avg_diff] )

<ipython-input-2-1e819db94549> in <listcomp>(.0) 22 # This requires your dates to be sorted, ascending, but that just takes 23 # wrapping 'value' in 'sorted' if it isn't sorted yet ---> 24 avg_diff = mean([v[i] - v[i-1] for i, v in enumerate(value) if i]) 25 writer.writerow( [key, avg_diff] )

TypeError: 'float' object is not subscriptable

这就是我现在遇到的更新代码。

3 个答案:

答案 0 :(得分:1)

看起来你只需要一个简单的函数来计算平均值。

def avg(iterable):
  count = 0
  running_sum = 0
  for item in iterable:
     running_sum += item
     count += 1
  return running_sum / float(count)

现在你只需要这些值。如果我了解您的意图,您希望i处的值减去i - 1处的值......

itertools有一个几乎可以做到这一点的方法,但是如果你想要的话,如果没有itertools你自己编写应该不难:

from itertools import tee, izip
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

我们没有区别,但在生成器中我们可以很容易地将其传递给我们的avg函数(因为我们谨慎地使avg任何<一起工作/ em> iterable,而不仅仅是序列):

average = avg(n - p for p, n in pairwise(values))

答案 1 :(得分:1)

而不是max(value) - min(value),似乎(如果我理解正确的话)你可以写:

def mean(x):
    return float(sum(x))/len(x)

...
for key, value in d.items():
    if len(value) == 1:
        avg_diff = None # or 0 -- this indicates there was only 1 purchase
    else:
        # This requires your dates to be sorted, ascending
        sv = sorted(value)
        avg_diff = mean([sv[i] - sv[i-1] for i in range(len(sv)) if i])
    writer.writerow( [key, avg_diff] )

这将为您提供每个人的平均日期长度。

我认为None对于单一购买者来说更好,因为在同一天购买两件东西时0是有效值。

答案 2 :(得分:0)

正如您在其他帖子中所提到的,此代码应该修复它。它将获取每个名称的所有日期,并将其与该名称相关联作为子列表。然后,它对子列表进行排序以按顺序获取日期,最后在最大和最小日期之间写入AVERAGE。平均最好用它自己的功能完成,但我保持简单(呃)。

import csv
index = []
input_file = 'input.csv'
output_file = 'output.csv'

def find_name(index, name):
    """ Binary search to see if the name exist in the index, yet. """
    if len(index) == 0:
        return -1
    start = 0
    limit = len(index) - 1
    while start <= limit:
        guess = (start + limit) / 2
        if index[guess][0] == name:
            return guess
        elif index[guess][0] < name:
            start = guess + 1
        else:
            limit = guess - 1
    return -1

def add_to_index(index, name, date):
    """ sorts the existing index.   Sends the variables to "find_name".
        if the name is round, returns the address of the name in the list.
        if it's not found, it returns a -1. """
    index.sort()
    name_index = find_name(index, name)
    if name_index == -1:
        index.append([name, [date]])
    else:
        index[name_index][1].append(date)


""" Read throught each row of the input file, skipping the header. 
    send each row to the "add_to_index" function."""
with open( input_file, 'rb' ) as infile:
    reader = csv.reader(infile, delimiter='\t')
    next(reader, None)  # skip the header
    for row in reader:
        add_to_index(index, row[0], row[1])

""" Write the output from the index back to the output file, only
    showing writing the earliest date for each user. """
with open( output_file, 'wb' ) as outfile:
    writer = csv.writer(outfile, delimiter='\t')
    for e in index:
        print e
        name = e[0]
        if len(e[1]) == 1:  #if only one dates, answer is 0
            average_days = 0
        elif len(e[1]) == 2:  #if only two dates, answer is the diff
            e[1].sort()
            average_days = int(e[1][-1]) - int(e[1][0])
        else:  #if more than two dates, average.
            e[1].sort()
            total = 0
            total_dates = len(e[1])
            print total_dates
            count = len(e[1]) - 1
            while count > 0:
                total += int(e[1][count]) - int(e[1][count - 1])
                print total
                count -= 1
            average_days = total / total_dates
        writer.writerow([name, average_days])

我创建了一个新的输入文件来获取两个以上的日期。它看起来像这样:

Bill to Name    Date
James Doe       41929
Jane Doe        41852
Adam Adamson    42244
Adam Adamson    41529
Adam Adamson    41852

输出如下:

Adam Adamson    238
James Doe       0
Jane Doe        0