我想计算序列AD-VV-DD
和AD-MM-PP
的平均时间。此序列可能出现在任何列MD_*
中。要计算平均时间,应使用列TIME_*
:
df =
MD_1 MD_2 MD_3 MD_4 MD_5 TIME_1 TIME_2 TIME_3 TIME_4 TIME_5
NaN AD VV DD NaN NaN 3 2 1 NaN
AD VV DD NaN NaN 1 1 1 NaN NaN
AD MM PP NaN NaN 4 3 3 NaN NaN
TT AD MM NaN NaN 2 4 NaN NaN NaN
结果应该是这个:
result =
MD_1_new MD_2_new MD_3_new TIME_1_new TIME_2_new TIME_3_new
AD VV DD 2 1.5 1
AD MM PP 4 3 3
第一行的TIME_*
列计算如下:在df
中有两个序列AD-VV-DD
。根据{{1}}中TIME_X
的值选择列X
。
这是我尝试过的,但我如何计算相应MD_X
的平均值?:
TIME_*
答案 0 :(得分:1)
以下是一些符合您要求的代码。它的主要组织原则是为我们需要查找的每个键构建tuple
,然后使用这些键构建一个dict。对于数据框中的每一行,检查三个可能位置中是否存在密钥。通过查看字典中是否存在密钥来检查。如果存在,则存储对齐的数据值以便稍后进行平均。
<强>代码:强>
# build a dict with tuple keys for the results
matches = {
('AD', 'VV', 'DD'): [],
('AD', 'MM', 'PP'): [],
}
# for each row check for key matches
for i, row in df.iterrows():
keys = tuple(row.values[0:5])
for j in range(3):
try:
# check if these three columns match one of our tuple keys
# if it matches append the three columns of data
matches[tuple(keys[j:j+3])].append(
row.values[5+j:8+j].astype(int))
break
except KeyError:
pass
# average the data
avg = {}
for k, v in matches.items():
avg[k] = sum(v) / float(len(v))
print(avg)
测试数据:
data = [x.strip().split() for x in """
MD_1 MD_2 MD_3 MD_4 MD_5 TIME_1 TIME_2 TIME_3 TIME_4 TIME_5
NaN AD VV DD NaN NaN 3 2 1 NaN
AD VV DD NaN NaN 1 1 1 NaN NaN
AD MM PP NaN NaN 4 3 3 NaN NaN
TT AD MM NaN NaN 2 4 NaN NaN NaN
""".split('\n')[1:-1]]
df = pd.DataFrame(data[1:], columns=data[0])
<强>输出:强>
{('AD', 'VV', 'DD'): array([ 2. , 1.5, 1. ]), ('AD', 'MM', 'PP'): array([ 4., 3., 3.])}