如何计算不同列的平均值?

时间:2017-02-16 21:24:50

标签: python pandas numpy

我想计算序列AD-VV-DDAD-MM-PP的平均时间。此序列可能出现在任何列MD_*中。要计算平均时间,应使用列TIME_*

df = 
MD_1   MD_2   MD_3    MD_4   MD_5  TIME_1  TIME_2  TIME_3  TIME_4  TIME_5
NaN    AD     VV      DD     NaN   NaN     3       2       1       NaN
AD     VV     DD      NaN    NaN   1       1       1       NaN     NaN
AD     MM     PP      NaN    NaN   4       3       3       NaN     NaN
TT     AD     MM      NaN    NaN   2       4       NaN     NaN     NaN    

结果应该是这个:

result = 
MD_1_new   MD_2_new   MD_3_new   TIME_1_new TIME_2_new  TIME_3_new
AD         VV         DD         2          1.5         1
AD         MM         PP         4          3           3 

第一行的TIME_*列计算如下:在df中有两个序列AD-VV-DD。根据{{​​1}}中TIME_X的值选择列X

这是我尝试过的,但我如何计算相应MD_X的平均值?:

TIME_*

1 个答案:

答案 0 :(得分:1)

以下是一些符合您要求的代码。它的主要组织原则是为我们需要查找的每个键构建tuple,然后使用这些键构建一个dict。对于数据框中的每一行,检查三个可能位置中是否存在密钥。通过查看字典中是否存在密钥来检查。如果存在,则存储对齐的数据值以便稍后进行平均。

<强>代码:

# build a dict with tuple keys for the results
matches = {
    ('AD', 'VV', 'DD'): [],
    ('AD', 'MM', 'PP'): [],
}

# for each row check for key matches
for i, row in df.iterrows():
    keys = tuple(row.values[0:5])
    for j in range(3):
        try:
            # check if these three columns match one of our tuple keys
            # if it matches append the three columns of data
            matches[tuple(keys[j:j+3])].append(
                row.values[5+j:8+j].astype(int))
            break
        except KeyError:
            pass

# average the data
avg = {}
for k, v in matches.items():
    avg[k] = sum(v) / float(len(v))
print(avg)

测试数据:

data = [x.strip().split() for x in """
    MD_1   MD_2   MD_3    MD_4   MD_5  TIME_1  TIME_2  TIME_3  TIME_4  TIME_5
    NaN    AD     VV      DD     NaN   NaN     3       2       1       NaN
    AD     VV     DD      NaN    NaN   1       1       1       NaN     NaN
    AD     MM     PP      NaN    NaN   4       3       3       NaN     NaN
    TT     AD     MM      NaN    NaN   2       4       NaN     NaN     NaN
""".split('\n')[1:-1]]
df = pd.DataFrame(data[1:], columns=data[0])

<强>输出:

{('AD', 'VV', 'DD'): array([ 2. ,  1.5,  1. ]), ('AD', 'MM', 'PP'): array([ 4.,  3.,  3.])}