Question

我一直想知道如何解决以下问题。假设我有一个数据帧df，如下所示：

Name     quantity     price
A        1            10.0
A        3            26.0
B        1            15.0
B        3            30.0
...

现在，假设我想按数量推断价格，并为每个Name创建quantity = 1,2,3行，这是可用数量列表和相应价格的一部分。（即我说我有一个函数extrapolate(qts, prices, n)根据已知quantity=n和qts计算prices的价格，结果如下：

Name     quantity     price
A        1            10.0
A        2            extrapolate([1, 3], [10.0, 26.0], 2)
A        3            26.0
B        1            15.0
B        2            extrapolate([1, 3], [15.0, 30.0], 2)
B        3            30.0
...

我很欣赏有关如何实现这一目标的一些见解，或者参考的地方，以了解有关groupby如何用于此案例的更多信息

提前谢谢

Answer 1

您想要什么称为缺少数据插补。有很多方法。

您可能需要检查名为fancyimpute的包。它使用MICE提供输入数据，这似乎可以满足您的需求。

除此之外，如果您的案例结构与示例一样简单，您始终可以groupby('Name').mean()，并且您将获得每个子组的中间值。

Answer 2

以下内容应该按照您的描述进行：

def get_extrapolate_val(group, qts, prices, n):

    # do your actual calculations here; now it returns just a dummy value
    some_value = (group[qts] * group[prices]).sum() / n

    return some_value

# some definitions
n = 2
quan_col = 'quantity'
price_col = 'price'

首先，我们将Name然后apply分组为get_extrapolate_val到每个组，我们将其他列名称和n作为参数传递。由于这会返回一个系列对象，因此我们需要额外的reset_index和rename，这样可以更轻松地进行连接。

new_stuff = df.groupby('Name').apply(get_extrapolate_val, quan_col, price_col, n).reset_index().rename(columns={0: price_col})

添加n作为附加列

new_stuff[quan_col] = n

我们concatenate两个数据帧已完成

final_df = pd.concat([df, new_stuff]).sort_values(['Name', quan_col]).reset_index(drop=True)

  Name  price  quantity
0    A   10.0         1
1    A   44.0         2
2    A   26.0         3
3    B   15.0         1
4    B   52.5         2
5    B   30.0         3

我现在添加的值当然毫无意义，但只是用来说明方法。

旧版

假设1列中始终只有3和quantity，则以下内容应该有效：

new_stuff = df.groupby('Name', as_index=False)['price'].mean()

这给出了

  Name  price
0    A   18.0
1    B   22.5

那 - 正如所写 - 假设它始终只有1和3，所以我们可以简单地计算mean。

然后我们添加2

new_stuff['quantity'] = 2

和concatenate两个带有额外排序的数据框

pd.concat([df, new_stuff]).sort_values(['Name', 'quantity']).reset_index(drop=True)

给出了预期的结果

  Name  price  quantity
0    A   10.0         1
1    A   18.0         2
2    A   26.0         3
3    B   15.0         1
4    B   22.5         2
5    B   30.0         3

虽然可能有更优雅的方法来做到这一点......

在pandas DataFrame中按数量推断行

2 个答案: