Question

我有一套测量;每个度量值都是DataFrame中的一行。我想在这些测量中添加一列，以反映该测量与参考相比的加速。每个测量值都以其数据集＆＃34;和＆＃34;算法＆＃34;，并且每个数据集 - 算法对都有一个参考运行时。

col = ['program', 'dataset', 'algorithm', 'extra', 'runtime']
df = pandas.DataFrame(
    [['program-ref', 'dataset-X', 'algorithm-i', 'x', 1.0],
     ['program-ref', 'dataset-X', 'algorithm-j', 'x', 2.0],
     ['program-ref', 'dataset-Y', 'algorithm-i', 'x', 3.0],
     ['program-ref', 'dataset-Y', 'algorithm-j', 'x', 4.0],
     ['program-B', 'dataset-X', 'algorithm-i', 'x', 5.0],
     ['program-B', 'dataset-X', 'algorithm-j', 'x', 6.0],
     ['program-B', 'dataset-Y', 'algorithm-i', 'x', 7.0],
     ['program-B', 'dataset-Y', 'algorithm-j', 'x', 8.0],
     ['program-C', 'dataset-X', 'algorithm-i', 'x', 9.0],
     ['program-D', 'dataset-X', 'algorithm-j', 'x', 10.0],
     ['program-E', 'dataset-Y', 'algorithm-i', 'x', 11.0],
     ['program-E', 'dataset-Y', 'algorithm-j', 'x', 12.0],
    ], columns=col)

我想添加一个名为＆＃39; speedup＆＃39;哪里加速＆＃39;对于每个测量，计算为测量的运行时间（倒数）除以参考测量的运行时间（对于该数据集 - 算法对）。例如，在上面的DataFrame中，＆＃39;加速＆＃39;第5行（程序B，数据集X，算法i）应为1 /(5.0 / 1.0）。

这似乎是split-apply-combine（http://pandas.pydata.org/pandas-docs/stable/groupby.html）的一个实例，但是那里显示的apply函数通常是组中的所有内容的聚合，或者其输入只是一个特定度量的函数。在这里，我需要＆＃34;申请＆＃34;对其组中所有事物的参考测量。

我还添加了额外的＆＃39;上面的列是因为我希望输出与输入相同，除了新的＆＃39;加速＆＃39;列，而groupby似乎想要剔除所有＆＃34;滋扰＆＃34;列。

Answer 1

我不喜欢设置数据来实现目标，因为每个算法 - 数据集组合有多个程序名称。另请注意，由于存在程序数据集算法值的唯一组合，因此对于样本数据，groupby方法是无关紧要的。也许你的真实数据有不同的要求？如果是，请更新样本数据以反映要求。在此期间，请尝试以下方法。

将参考值与其余数据合并会更容易，这样相应的值就可以更容易地相互关联。

ref_df = df.loc[df['program'] == 'program-ref', ['dataset', 'algorithm', 'runtime']]
# EDIT: only include the following line if you wish to remove the reference
# rows from the final output
# df = df.loc[~(df['program'] == 'program-ref')]

new_df = pd.merge(df, ref_df, on=['dataset', 'algorithm'],
                              suffixes=['', '_ref'])

# you don't actually need a groupby since there are unique 
# program-dataset-algorithm combinations.
new_df['speedup'] = 1/(new_df['runtime']/new_df['runtime_ref'])

# optional groupby approach
new_df['speedup'] = new_df.groupby(['program', 'dataset', 'algorithm']).apply(
                           lambda x: 1/(x['runtime']/x['runtime_ref'])).values

>>> new_df.sort_values('program', ascending=False)
        program    dataset    algorithm extra  runtime  runtime_ref   speedup
0   program-ref  dataset-X  algorithm-i     x      1.0          1.0  1.000000
3   program-ref  dataset-X  algorithm-j     x      2.0          2.0  1.000000
6   program-ref  dataset-Y  algorithm-i     x      3.0          3.0  1.000000
9   program-ref  dataset-Y  algorithm-j     x      4.0          4.0  1.000000
8     program-E  dataset-Y  algorithm-i     x     11.0          3.0  0.272727
11    program-E  dataset-Y  algorithm-j     x     12.0          4.0  0.333333
5     program-D  dataset-X  algorithm-j     x     10.0          2.0  0.200000
2     program-C  dataset-X  algorithm-i     x      9.0          1.0  0.111111
1     program-B  dataset-X  algorithm-i     x      5.0          1.0  0.200000
4     program-B  dataset-X  algorithm-j     x      6.0          2.0  0.333333
7     program-B  dataset-Y  algorithm-i     x      7.0          3.0  0.428571
10    program-B  dataset-Y  algorithm-j     x      8.0          4.0  0.500000

pandas：规范化组内的值，每组有一个参考值（groupby？split-apply-combine？）

1 个答案: