我有一个由日期,car_id和目的地组成的数据集。
对于每一行,我想要每个car_id的唯一目的地的累计数量。重要的是,计数器应尽早开始。
所需的输出是“ unique_destinations”列:
date car_id destination unique_destinations
0 01/01/2019 1 Boston 1
1 01/01/2019 2 Miami 1
2 02/01/2019 1 Boston 1
3 02/01/2019 2 Orlando 2
4 03/01/2019 1 New York 2
5 03/01/2019 2 Tampa 3
6 04/01/2019 1 Boston 2
7 04/01/2019 2 Miami 3
8 05/01/2019 1 Washington 3
9 05/01/2019 2 Jacksonville 4
10 06/01/2019 1 New York 3
11 06/02/2019 2 Atlanta 5
答案 0 :(得分:0)
好的,这可能不是很有效,但它是一种方法:)
def check(data):
seen = []
flag = 0
for index,row in data.iterrows():
if row['destination'] not in seen:
flag+=1
data['unique_destinations'][index] = flag
seen.append(row['destination'])
else:
data['unique_destinations'][index] = flag
return data
df['unique_destinations'] = 0
df.groupby('car_id').apply(check)
输出
0 1
1 1
2 1
3 2
4 2
5 3
6 2
7 3
8 3
9 4
10 3
11 5
Name: unique_destinations, dtype: int64
答案 1 :(得分:0)
我们还可以按汽车ID拆分数据,然后运行如下所示的自定义函数:
idx = argsort(data['class'])
tmp = data[idx]
data_sorted = tmp[argsort(tmp['value'], kind='stable')]
输出:
def create_uniques(df):
dests = []
uniques = []
counter = 0
for ix, r in df.iterrows():
if r['destination'] not in dests:
counter += 1
dests.append(r['destination'])
uniques.append(counter)
else:
uniques.append(counter)
df['unique_destinations'] = uniques
return df
df1 = df[df['car_id'] == 1].reset_index(drop=True)
df2 = df[df['car_id'] == 2].reset_index(drop=True)
df_final = pd.concat([create_uniques(df1), create_uniques(df2)], ignore_index=True).sort_values('date')
时间以及其他答案:
Erfans 答案:
print(df_final)
date car_id destination unique_destinations
0 2019-01-01 1 Boston 1
6 2019-01-01 2 Miami 1
1 2019-02-01 1 Boston 1
7 2019-02-01 2 Orlando 2
2 2019-03-01 1 New York 2
8 2019-03-01 2 Tampa 3
3 2019-04-01 1 Boston 2
9 2019-04-01 2 Miami 3
4 2019-05-01 1 Washington 3
10 2019-05-01 2 Jacksonville 4
5 2019-06-01 1 New York 3
11 2019-06-02 2 Atlanta 5
Iamklaus 答案:
%%timeit
def create_uniques(df):
dests = []
uniques = []
counter = 0
for ix, r in df.iterrows():
if r['destination'] not in dests:
counter += 1
dests.append(r['destination'])
uniques.append(counter)
else:
uniques.append(counter)
df['unique_destinations'] = uniques
return df
df1 = df[df['car_id'] == 1].reset_index(drop=True)
df2 = df[df['car_id'] == 2].reset_index(drop=True)
df_final = pd.concat([create_uniques(df1), create_uniques(df2)], ignore_index=True).sort_values('date')
11 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
nikhilbalwani的答案:
%%timeit
def check(data):
seen = []
flag = 0
for index,row in data.iterrows():
if row['destination'] not in seen:
flag+=1
data['unique_destinations'][index] = flag
seen.append(row['destination'])
else:
data['unique_destinations'][index] = flag
return data
df['unique_destinations'] = 0
df.groupby('car_id').apply(check)
15.3 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
答案 2 :(得分:0)
尝试这段简短的代码:
for index, row in df.iterrows():
unique_before_date = df[df['date'] <= row['date']].groupby(['car_id'])['destination'].nunique()
df['unique_destinations'][index] = int(unique_before_date[row['car_id']])
print(df)
它产生以下输出:
date car_id destination unique_destinations
0 2019-01-01 1 Boston 1
1 2019-01-01 2 Miami 1
2 2019-01-02 1 Boston 1
3 2019-01-02 2 Orlando 2
4 2019-01-03 1 New York 2
5 2019-01-03 2 Tampa 3
6 2019-01-04 1 Boston 2
7 2019-01-04 2 Miami 3
8 2019-01-05 1 Washington 3
9 2019-01-05 2 Jacksonville 4
10 2019-01-06 1 New York 3
11 2019-02-06 2 Atlanta 5
答案 3 :(得分:-1)
这里假设您想每隔一天增加+1
import pandas as pd
import datetime as dt
df['unique destinations'] = ((df['date']) - min(df['date'])).dt.days + 1
但是,如果您只需要在新的一天不一定要紧跟彼此进行迭代时,可以这样做:
a = 1
unique_destinations = []
for index, row in df.iterrows():
try:
if row['date'] == currentdate:
pass
else:
a = a + 1
except:
pass
unique_destinations.append(a)
currentdate = row['date']
df['unique_destinations'] = unique_destinations