我有以下数据集:
df = pd.DataFrame({'timestamp': np.repeat(pd.date_range('2019-08-01', '2019-08-03'), 3),
'group': ['A', 'B', 'C', 'B', 'B', 'C', 'B', 'C', 'C'],
'id_appear': [np.nan, 1, np.nan, 1, 2, np.nan, 1, np.nan, np.nan]})
看起来像:
df.sort_values('group', inplace=True)
timestamp group id_appear
0 2019-08-01 A NaN
1 2019-08-01 B NaN
3 2019-08-02 B NaN
4 2019-08-02 B NaN
6 2019-08-03 B NaN
2 2019-08-01 C NaN
5 2019-08-02 C NaN
7 2019-08-03 C NaN
8 2019-08-03 C NaN
但是有将近600万行。
我想在1到N之间插入一系列数字,其中N代表每天一组出现的次数。
我期望以下内容:
timestamp group id_appear
0 2019-08-01 A 1.0
1 2019-08-01 B 1.0
3 2019-08-02 B 1.0
4 2019-08-02 B 2.0
6 2019-08-03 B 1.0
2 2019-08-01 C 1.0
5 2019-08-02 C 1.0
7 2019-08-03 C 1.0
8 2019-08-03 C 2.0
我尝试了以下代码:
indexes = df.index
count = 1
saved = None
for pos, (index, row) in enumerate(df.iterrows()):
if pos == 0 or ((row['group'] != saved['group']) or (row['timestamp'] != saved['timestamp'])):
count = 1
else:
count +=1
df.loc[index, 'id_appear'] = count
saved = row
尽管它起作用,但效率极低。如何提高这段代码的效率?
谢谢!
答案 0 :(得分:0)
id_appear
中的所有NaN替换为某个数字(我使用0)group
和timestamp
列计数到一个新的数据框代码如下:
df = pd.DataFrame({'timestamp': np.repeat(pd.date_range('2019-08-01', '2019-08-03'), 3),
'group': ['A', 'B', 'C', 'B', 'B', 'C', 'B', 'C', 'C'],
'id_appear': [np.nan, 1, np.nan, 1, 2, np.nan, 1, np.nan, np.nan]})
df['id_appear'] = 0
df_ = df.groupby(by=['group','timestamp']).count()
df.drop(['id_appear'], axis=1, inplace=True)
df = pd.merge(left=df, right=df_, how='inner', on=['timestamp','group'])
提供输出
timestamp group id_appear
0 2019-08-01 A 1
1 2019-08-01 B 1
2 2019-08-01 C 1
3 2019-08-02 B 2
4 2019-08-02 B 2
5 2019-08-02 C 1
6 2019-08-03 B 1
7 2019-08-03 C 2
8 2019-08-03 C 2