我正在尝试创建一个仅在与上一行不同或我按更改分组的ID不同时才更改值的计数器
假设我具有以下数据框:
ID Flag New_Column
A NaN 1
A 0 1
A 0 1
A 0 1
A 1 2
A 1 2
A 1 2
A 0 3
A 0 3
A 0 3
A 1 4
A 1 4
A 1 4
B NaN 1
B 0 1
我想创建New_Column,每次Flag值更改时,我都会将New_Column加1,如果ID更改,它将重置为1并重新开始
这是我尝试使用np.select进行的操作,但是它不起作用
df['New_Column'] = None
df['Flag_Lag'] = df.sort_values(by=['ID', 'Date_Time'], ascending=True).groupby(['ID'])['Flag'].shift(1)
df['ID_Lag'] = df.sort_values(by=['ID', 'Date_Time'], ascending=True).groupby(['ID'])['ID'].shift(1)
conditions = [((df['Flag'] != df['Flag_Lag']) & (df['ID'] == df['ID_Lag'])),
((df['Flag'] == df['Flag_Lag']) & (df['ID'] == df['ID_Lag'])),
((df['Flag_Lag'] == np.nan) & (df['New_Column'].shift(1) == 1)),
((df['ID'] != df['ID_Lag']))
]
choices = [(df['New_Column'].shift(1) + 1),
(df['New_Column'].shift(1)),
(df['New_Column'].shift(1)),
1]
df['New_Column'] = np.select(conditions, choices, default=np.nan)
使用此代码,New_Column的第一个值为1,第二个为NaN,其余为无
有人知道更好的方法吗?
答案 0 :(得分:1)
按ID分组并使用的总和(当前不等于先前的值)
df['new'] = df.groupby('ID') \
apply(lambda x: x['Flag'].fillna(0).diff().ne(0).cumsum()).reset_index(level=0, drop=True)
ID Flag New_Column new
0 A NaN 1 1
1 A 0.0 1 1
2 A 0.0 1 1
3 A 0.0 1 1
4 A 1.0 2 2
5 A 1.0 2 2
6 A 1.0 2 2
7 A 0.0 3 3
8 A 0.0 3 3
9 A 0.0 3 3
10 A 1.0 4 4
11 A 1.0 4 4
12 A 1.0 4 4
13 B NaN 1 1
14 B 0.0 1 1
答案 1 :(得分:1)
如果速度不是问题,并且您需要一些易于阅读的代码,则可以简单地遍历数据帧并为每行运行一个简单函数。
def f(row):
global previous_ID, previous_flag, previous_count
if previous_ID == False: #let's start the count
row['New_Column'] = 1
elif previous_ID != row['ID']: #let's start the count over
row['New_Column'] = 1
elif previous_flag == row['Flag']: #same ID, same Flag
row['New_Column'] = previous_count
else: #same ID, different Flag
row['New_Column'] = previous_count + 1
previous_ID = row['ID']
previous_flag = row['Flag']
previous_count = row['New_Column']
您应该用0填充NaN值,或者在函数中添加一个特殊情况。
您可以通过以下方式运行该功能:
previous_ID, previous_flag, previous_count = False, False, False
df['New_Columns'] = []
for i, row in df.iterrows():
f(row)
就是这样。