Question

我有一个由日期，car_id和目的地组成的数据集。

对于每一行，我想要每个car_id的唯一目的地的累计数量。重要的是，计数器应尽早开始。

所需的输出是“ unique_destinations”列：

          date  car_id   destination  unique_destinations
0   01/01/2019       1        Boston                    1
1   01/01/2019       2         Miami                    1
2   02/01/2019       1        Boston                    1
3   02/01/2019       2       Orlando                    2
4   03/01/2019       1      New York                    2
5   03/01/2019       2         Tampa                    3
6   04/01/2019       1        Boston                    2
7   04/01/2019       2         Miami                    3
8   05/01/2019       1    Washington                    3
9   05/01/2019       2  Jacksonville                    4
10  06/01/2019       1      New York                    3
11  06/02/2019       2       Atlanta                    5

Answer 1

好的，这可能不是很有效，但它是一种方法：）

def check(data):
    seen = []
    flag = 0
    for index,row in data.iterrows():
        if row['destination'] not in seen:
            flag+=1
            data['unique_destinations'][index] = flag
            seen.append(row['destination'])
        else:
            data['unique_destinations'][index] = flag
    return data

df['unique_destinations'] = 0
df.groupby('car_id').apply(check)

输出

0     1
1     1
2     1
3     2
4     2
5     3
6     2
7     3
8     3
9     4
10    3
11    5
Name: unique_destinations, dtype: int64

Answer 2

我们还可以按汽车ID拆分数据，然后运行如下所示的自定义函数：

idx = argsort(data['class'])
tmp = data[idx]
data_sorted = tmp[argsort(tmp['value'], kind='stable')]

输出：

def create_uniques(df):
    dests = []
    uniques = []
    counter = 0
    for ix, r in df.iterrows():
        if r['destination'] not in dests:
            counter += 1
            dests.append(r['destination'])
            uniques.append(counter)
        else:
            uniques.append(counter)

    df['unique_destinations'] = uniques

    return df

df1 = df[df['car_id'] == 1].reset_index(drop=True)
df2 = df[df['car_id'] == 2].reset_index(drop=True)

df_final = pd.concat([create_uniques(df1), create_uniques(df2)], ignore_index=True).sort_values('date')

时间以及其他答案：
Erfans 答案：

print(df_final)
         date  car_id   destination  unique_destinations
0  2019-01-01       1        Boston                    1
6  2019-01-01       2         Miami                    1
1  2019-02-01       1        Boston                    1
7  2019-02-01       2       Orlando                    2
2  2019-03-01       1      New York                    2
8  2019-03-01       2         Tampa                    3
3  2019-04-01       1        Boston                    2
9  2019-04-01       2         Miami                    3
4  2019-05-01       1    Washington                    3
10 2019-05-01       2  Jacksonville                    4
5  2019-06-01       1      New York                    3
11 2019-06-02       2       Atlanta                    5

Iamklaus 答案：

%%timeit

def create_uniques(df):
    dests = []
    uniques = []
    counter = 0
    for ix, r in df.iterrows():
        if r['destination'] not in dests:
            counter += 1
            dests.append(r['destination'])
            uniques.append(counter)
        else:
            uniques.append(counter)

    df['unique_destinations'] = uniques

    return df

df1 = df[df['car_id'] == 1].reset_index(drop=True)
df2 = df[df['car_id'] == 2].reset_index(drop=True)

df_final = pd.concat([create_uniques(df1), create_uniques(df2)], ignore_index=True).sort_values('date')

11 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

nikhilbalwani的答案：

%%timeit

def check(data):
    seen = []
    flag = 0
    for index,row in data.iterrows():
        if row['destination'] not in seen:
            flag+=1
            data['unique_destinations'][index] = flag
            seen.append(row['destination'])
        else:
            data['unique_destinations'][index] = flag
    return data

df['unique_destinations'] = 0
df.groupby('car_id').apply(check)

15.3 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Answer 3

尝试这段简短的代码：

for index, row in df.iterrows():
    unique_before_date = df[df['date'] <= row['date']].groupby(['car_id'])['destination'].nunique()
    df['unique_destinations'][index] = int(unique_before_date[row['car_id']])

print(df)

它产生以下输出：

         date  car_id   destination unique_destinations
0  2019-01-01       1        Boston                   1
1  2019-01-01       2         Miami                   1
2  2019-01-02       1        Boston                   1
3  2019-01-02       2       Orlando                   2
4  2019-01-03       1      New York                   2
5  2019-01-03       2         Tampa                   3
6  2019-01-04       1        Boston                   2
7  2019-01-04       2         Miami                   3
8  2019-01-05       1    Washington                   3
9  2019-01-05       2  Jacksonville                   4
10 2019-01-06       1      New York                   3
11 2019-02-06       2       Atlanta                   5

Answer 4

这里假设您想每隔一天增加+1

import pandas as pd
import datetime as dt
df['unique destinations'] = ((df['date']) - min(df['date'])).dt.days + 1

但是，如果您只需要在新的一天不一定要紧跟彼此进行迭代时，可以这样做：

a = 1
unique_destinations = []
for index, row in df.iterrows():
    try:
        if row['date'] == currentdate:
            pass
        else:
            a = a + 1
    except:
        pass
    unique_destinations.append(a)
    currentdate = row['date']
df['unique_destinations'] = unique_destinations

如何随着时间累积每个ID的唯一行值的数量

4 个答案: