我有一个名为df
的数据框,看起来与此类似(“日期”列的数量增加到Date_8
且有数百个客户端-我在这里已对其进行了简化)。 / p>
Client_ID Date_1 Date_2 Date_3 Date_4
C1019876 relationship no change no change no change
C1018765 no change single no change no change
C1017654 single no change relationship NaN
C1016543 NaN relationship no change single
C1015432 NaN no change single NaN
我想创建两个新列first_status
和last_status
。 first_status
应该等于4个日期列中第一个给定的关系状态,即第一个响应为relationship
或single
,而last_status
应该等于中的最后给定的关系状态4个日期列。产生的df
应该看起来像这样。
Client_ID Date_1 Date_2 Date_3 Date_4 first_status last_status
C1019876 relationship no change no change no change relationship relationship
C1018765 no change single no change no change single single
C1017654 single no change relationship NaN single relationship
C1016543 NaN relationship no change single relationship single
C1015432 NaN no change single NaN single single
我认为可以通过列表理解来创建这两列,但我不知道如何。对于first_status
列,我想代码会在df
的每一行中执行类似以下的操作:
Date
列,其中给出了一个值(过滤出NaN)no change
,请转到下一个Date
列relationship
,则first_status
= relationship
single
,则first_status
= single
对于last_status
列,我想代码会在df
的每一行中执行类似以下的操作:
Date
列,该列中给出一个值(过滤掉NaN)no change
,请转到上一个Date
列relationship
,则last_status
= relationship
single
,则last_status
= single
答案 0 :(得分:3)
您可以将replace
no change
与np.nan
一起使用,并分别使用bfill
和ffill
选择第一个和最后一个有效值:
df = df.replace('no change', np.nan)
df['first_status'] = df.bfill(axis=1).Date_1
df['last_status'] = df.loc[:,:'Date_4'].ffill(axis=1).Date_4
#df = df.fillna('no_change') # if needed
Client_ID Date_1 Date_2 Date_3 Date_4 first_status \
0 C1019876 relationship NaN NaN NaN relationship
1 C1018765 NaN single NaN NaN single
2 C1017654 single NaN relationship NaN single
3 C1016543 NaN relationship NaN single relationship
4 C1015432 NaN NaN single NaN single
last_status
0 relationship
1 single
2 relationship
3 single
4 single
如果有Date
到n
的列,请对df.loc[:,:'Date_n'].ffill(axis=1).Date_n
使用last_status
答案 1 :(得分:0)
我想,如果您真的想使用列表理解,可以,但是@yatu的解决方案会更快:
# unstack and find the first column index where relationship or single occurs
first = df.unstack().groupby(level=1).apply(lambda x: (np.isin(x.values, ['relationship', 'single'])).argmax())
last = df.unstack()[::-1].groupby(level=1).apply(lambda x: (np.isin(x.values, ['relationship', 'single'])).argmax())
# list comprehension to find the index and column index pair
f_list = [x for x in enumerate(first)]
l_list = [x for x in enumerate(last)]
# list comprehension with iloc
f_val = [df.iloc[f_list[i]] for i in range(len(f_list))]
l_val = [df.loc[:, ::-1].iloc[l_list[i]] for i in range(len(l_list))]
# create columns
df['first'] = f_val
df['last'] = l_val
Client_ID Date_1 Date_2 Date_3 Date_4 \
0 C1019876 relationship no change no change no change
1 C1018765 no change single no change no change
2 C1017654 single no change relationship NaN
3 C1016543 NaN relationship no change single
4 C1015432 NaN no change single NaN
first last
0 relationship relationship
1 single single
2 single relationship
3 relationship single
4 single single
timeit结果:8 ms ± 230 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)