Question

我有这个熊猫数据框：

df = pd.DataFrame(
    {
    "col1": [1,1,2,3,3,3,4,5,5,5,5]
    }
)
df

如果col1中的值不等于下一行中col1的值，我想添加另一列“ last”。它应该是这样的：

到目前为止，如果col1中的值不等于下一行中col1的值，那么我可以创建一个包含True的列。否则为False

df["last_row"] = df["col1"].shift(-1)
df['last'] = df["col1"] != df["last_row"]
df = df.drop(["last_row"], axis=1)
df

现在类似

df["last_row"] = df["col1"].shift(-1)
df['last'] = "last" if df["col1"] != df["last_row"]
df = df.drop(["last_row"], axis=1)
df

会很好，但这显然是错误的语法。我该如何做到这一点？

最后，我还想添加数字，以指示一个值在此之前出现多少次，而最后一个值始终标记为“ last”。它应该看起来像这样：

我不确定这是我开发的又一步还是是否需要新方法。我读到如果要在修改值时循环遍历数组，则应使用apply（）。但是，我不知道如何在其中包括条件。你能帮我吗？

非常感谢！

Answer 1

这是一种方式。您可以基于col1中的下一个值是否与当前行的下一个相同，定义自定义分组程序并采用DataFrameGroupBy.cumsum来获得累积计数。然后使用last使用类似的条件添加df.shift：

g = df.col1.ne(df.col1.shift(1)).cumsum()
df['update'] = df.groupby(g).cumcount()
ix = df[df.col1.ne(df.col1.shift(-1))].index
# Int64Index([1, 2, 5, 6, 10], dtype='int64')
df.loc[ix,'update'] = 'last'

 col1 update
0      1      0
1      1   last
2      2   last
3      3      0
4      3      1
5      3   last
6      4   last
7      5      0
8      5      1
9      5      2
10     5   last

Answer 2

考虑到索引是递增索引，请（1）每组cuncount，然后在每组内获取（2）max索引并设置字符串

group = df.groupby('col1')

df['last'] = group.cumcount()
df.loc[group['last'].idxmax(), 'last'] = 'last'
#or df.loc[group.apply(lambda x: x.index.max()), 'last'] = 'last'


    col1    last
0   1   0
1   1   last
2   2   last
3   3   0
4   3   1
5   3   last
6   4   last
7   5   0
8   5   1
9   5   2
10  5   last

Answer 3

使用.shift查找发生变化的地方。然后，您可以使用.where进行适当的遮罩，然后使用.fillna

s = df.col1 != df.col1.shift(-1)
df['Update'] = df.groupby(s.cumsum().where(~s)).cumcount().where(~s).fillna('last')

输出：

    col1 Update
0      1      0
1      1   last
2      2   last
3      3      0
4      3      1
5      3   last
6      4   last
7      5      0
8      5      1
9      5      2
10     5   last

顺便说一句，update是DataFrames的一种方法，因此应避免命名列'update'

Answer 4

另一种可能的解决方案。

df['update'] = np.where(df['col1'].ne(df['col1'].shift(-1)), 'last', 0)

如何在条件下遍历熊猫数据框并修改值？

4 个答案:

输出：