给出如下所示的数据框,这就是我想要的:仅在包含每个序列号最早日期的行中,找到Location为空的行,并使用指定的默认值进行更新。
df = pd.DataFrame([['123456',pd.to_datetime('1/1/2019'),'Location A'],
['123456',pd.to_datetime('1/2/2019'),np.nan],
['123456',pd.to_datetime('1/3/2019'),np.nan],
['123456',pd.to_datetime('5/1/2019'),np.nan],
['654321',pd.to_datetime('2/1/2019'),'Location B'],
['654321',pd.to_datetime('2/2/2019'),'Location B'],
['654321',pd.to_datetime('2/3/2019'),'Location C'],
['112233',pd.to_datetime('3/1/2019'),np.nan],
['112233',pd.to_datetime('3/2/2019'),'Location D'],
['112233',pd.to_datetime('3/3/2019'),np.nan],
['445566',pd.to_datetime('4/1/2019'),'Location E'],
['445566',pd.to_datetime('4/2/2019'),'Location E'],
['445566',pd.to_datetime('4/3/2019'),'Location E'],
['778899',pd.to_datetime('5/1/2019'),np.nan],
['778899',pd.to_datetime('5/2/2019'),np.nan],
['778899',pd.to_datetime('5/3/2019'),np.nan],
['332211',pd.to_datetime('6/1/2019'),np.nan],
['332211',pd.to_datetime('6/2/2019'),'Location F'],
['332211',pd.to_datetime('6/3/2019'),'Location F'],
['665544',pd.to_datetime('7/1/2019'),'Location G'],
['665544',pd.to_datetime('7/2/2019'),'Location G'],
['665544',pd.to_datetime('7/3/2019'),'Location G'],
['998877',pd.to_datetime('8/1/2019'),'Location H'],
['998877',pd.to_datetime('8/2/2019'),'Location I'],
['998877',pd.to_datetime('8/2/2019'),'Location I'],
['147258',pd.to_datetime('9/1/2019'),np.nan],
['147258',pd.to_datetime('9/2/2019'),np.nan],
['147258',pd.to_datetime('9/3/2019'),'Location J']],
columns=['Serial','Date','Location'])
df
Out[498]:
Serial Date Location
0 123456 2019-01-01 Location A
1 123456 2019-01-02 NaN
2 123456 2019-01-03 NaN
3 123456 2019-05-01 NaN
4 654321 2019-02-01 Location B
5 654321 2019-02-02 Location B
6 654321 2019-02-03 Location C
7 112233 2019-03-01 NaN
8 112233 2019-03-02 Location D
9 112233 2019-03-03 NaN
10 445566 2019-04-01 Location E
11 445566 2019-04-02 Location E
12 445566 2019-04-03 Location E
13 778899 2019-05-01 NaN
14 778899 2019-05-02 NaN
15 778899 2019-05-03 NaN
16 332211 2019-06-01 NaN
17 332211 2019-06-02 Location F
18 332211 2019-06-03 Location F
19 665544 2019-07-01 Location G
20 665544 2019-07-02 Location G
21 665544 2019-07-03 Location G
22 998877 2019-08-01 Location H
23 998877 2019-08-02 Location I
24 998877 2019-08-02 Location I
25 147258 2019-09-01 NaN
26 147258 2019-09-02 NaN
27 147258 2019-09-03 Location J
因此,在以上示例中,仅应选择第6、12、15和24行。我已经在下面的代码行中工作了:
虽然功能正常,但感觉笨拙且回旋。有更好的方法吗?
df.loc[pd.Series(df.index).isin(df.groupby('Serial')['Date'].idxmin().tolist()) & df['Location'].isnull(), 'Location'] = 'XXXX'
df
Out[502]:
Serial Date Location
0 123456 2019-01-01 Location A
1 123456 2019-01-02 NaN
2 123456 2019-01-03 NaN
3 123456 2019-05-01 NaN
4 654321 2019-02-01 Location B
5 654321 2019-02-02 Location B
6 654321 2019-02-03 Location C
7 112233 2019-03-01 XXXX
8 112233 2019-03-02 Location D
9 112233 2019-03-03 NaN
10 445566 2019-04-01 Location E
11 445566 2019-04-02 Location E
12 445566 2019-04-03 Location E
13 778899 2019-05-01 XXXX
14 778899 2019-05-02 NaN
15 778899 2019-05-03 NaN
16 332211 2019-06-01 XXXX
17 332211 2019-06-02 Location F
18 332211 2019-06-03 Location F
19 665544 2019-07-01 Location G
20 665544 2019-07-02 Location G
21 665544 2019-07-03 Location G
22 998877 2019-08-01 Location H
23 998877 2019-08-02 Location I
24 998877 2019-08-02 Location I
25 147258 2019-09-01 XXXX
26 147258 2019-09-02 NaN
27 147258 2019-09-03 Location J
编辑:向示例df添加了新的第3行,以阐明日期在序列号组中是唯一的,但在序列号中可能不是唯一的。此示例中索引为3的行与另一个序列的最小日期具有相同的日期,但不应选择。我通过匹配索引而不是日期本身来解决这个问题,但是这样做的方式让人感到混乱。
答案 0 :(得分:1)
我认为您的解决方案“还可以”,但是您可以使用numpy
使它更加紧凑并加快速度。
您可以为此使用df.groupby.Series.min()
和df.Series.isnull()
。
此后,您有条件地用np.where
用Location
填充XXXX
列:
min_date = df.groupby('Serial')['Date'].min()
cond = df['Location'].isnull()
df['Location'] = np.where((df['Date'].isin(min_date)) & (cond) , 'XXXX', df['Location'])
print(df)
Serial Date Location
0 123456 2019-01-01 Location A
1 123456 2019-01-02 NaN
2 123456 2019-01-03 NaN
3 654321 2019-02-01 Location B
4 654321 2019-02-02 Location B
5 654321 2019-02-03 Location C
6 112233 2019-03-01 XXXX
7 112233 2019-03-02 Location D
8 112233 2019-03-03 NaN
9 445566 2019-04-01 Location E
10 445566 2019-04-02 Location E
11 445566 2019-04-03 Location E
12 778899 2019-05-01 XXXX
13 778899 2019-05-02 NaN
14 778899 2019-05-03 NaN
15 332211 2019-06-01 XXXX
16 332211 2019-06-02 Location F
17 332211 2019-06-03 Location F
18 665544 2019-07-01 Location G
19 665544 2019-07-02 Location G
20 665544 2019-07-03 Location G
21 998877 2019-08-01 Location H
22 998877 2019-08-02 Location I
23 998877 2019-08-02 Location I
24 147258 2019-09-01 XXXX
25 147258 2019-09-02 NaN
26 147258 2019-09-03 Location J
编辑在OP对重复日期发表评论后:
我们可以合并min_dates
数据框,并在合并时使用indicator=True
min_date = df.groupby('Serial')['Date'].min().reset_index()
cond = df['Location'].isnull()
df = df.merge(min_date, on=['Serial', 'Date'], how='left', indicator=True)
df['Location'] = np.where((df['_merge'] == 'both') & (cond) , 'XXXX', df['Location'])
df = df.drop('_merge', axis=1)
print(df)
Serial Date Location
0 123456 2019-01-01 Location A
1 123456 2019-01-02 NaN
2 123456 2019-01-03 NaN
3 123456 2019-05-01 NaN
4 654321 2019-02-01 Location B
5 654321 2019-02-02 Location B
6 654321 2019-02-03 Location C
7 112233 2019-03-01 XXXX
8 112233 2019-03-02 Location D
9 112233 2019-03-03 NaN
10 445566 2019-04-01 Location E
11 445566 2019-04-02 Location E
12 445566 2019-04-03 Location E
13 778899 2019-05-01 XXXX
14 778899 2019-05-02 NaN
15 778899 2019-05-03 NaN
16 332211 2019-06-01 XXXX
17 332211 2019-06-02 Location F
18 332211 2019-06-03 Location F
19 665544 2019-07-01 Location G
20 665544 2019-07-02 Location G
21 665544 2019-07-03 Location G
22 998877 2019-08-01 Location H
23 998877 2019-08-02 Location I
24 998877 2019-08-02 Location I
25 147258 2019-09-01 XXXX
26 147258 2019-09-02 NaN
27 147258 2019-09-03 Location J