Question

我正在处理一个包含“日期”、“ID”和“古代”的数据集。对于相同的产品，即使日期发生变化，古代也始终相同：

date        id          antiquity
01/06/2015  21972.00    5241.00
02/06/2015  21972.00    5241.00
03/06/2015  21972.00    5241.00
04/06/2015  21972.00    5241.00
05/06/2015  21972.00    5241.00

或：

date        id          antiquity
01/06/2015  28794.00    4157.00
02/06/2015  28794.00    4157.00
03/06/2015  28794.00    4157.00
04/06/2015  28794.00    4157.00
05/06/2015  28794.00    4157.00

这是这个数据集中的错误。我需要在每行和 id 的“古代”列中添加 1（从最旧的日期开始增加古代）：第一行将此 id 的“古代”值加 0，第二行向“古代”加 1 ' 此 id 的值，第三行将此产品 id 的 'antiquity' 值加 2... 以此类推。

我创建了一个执行此操作的定义：

def add_antiquity(dataframe):
    antiquity_id = dataframe.antiquity.values
    return pd.Series([int(antiquity_id[i])+i for i in range(0,len(antiquity_id))], index=dataframe.index)

我用一行来调用这个函数（只是为了测试这个函数）：

new_serie = add_antiquity(df[df['id'] == 21972.0])
df[df.index.isin(new_serie.index)]['antiquity'] = new_serie

当我执行它时，它返回给我一个“SettingWithCopyWarning”但它不起作用，数据帧值没有更新。我想为每个 id 做一个循环并调用这个函数。

我该如何做这个操作？有什么办法可以使用诸如 apply() 之类的任何 Pandas 函数吗？

谢谢！

Answer 1

使用 id 将每个 antiquity 组中的行号添加到 cumcount 列是否可行？

df['antiquity'] += df.groupby('id').cumcount()
df

输出：

         date       id  antiquity
0  01/06/2015  21972.0     5241.0
1  02/06/2015  21972.0     5242.0
2  03/06/2015  21972.0     5243.0
3  04/06/2015  21972.0     5244.0
4  05/06/2015  21972.0     5245.0
5  01/06/2015  28794.0     4157.0
6  02/06/2015  28794.0     4158.0
7  03/06/2015  28794.0     4159.0
8  04/06/2015  28794.0     4160.0
9  05/06/2015  28794.0     4161.0

附言当然，数据集必须按 date 排序才能工作。如果不是，请从

开始

df = df.sort_values('date')

P.P.S.如果由于某种原因你想用一个函数来做它（它比较慢，所以通常不推荐），你的代码的问题是你在数据帧的副本上设置新值（df[...][...] 返回复印件）。解决方法是使用 loc:

df.loc[df.index.isin(new_serie.index), 'antiquity'] = new_serie

熊猫从日期开始增加价值

1 个答案: