使用增量计数器创建列,以识别pandas

时间:2018-03-09 21:05:04

标签: python function pandas duplicates

我有一个很大的df,其列的子集相同dup_columns = ['id', 'subject','topic', 'lesson', 'time'],有些列是唯一的['timestamps'].

   id    subj   topic lesson  timestamp  time  dup_ind dup_group  time_diff
1   1  math    add     a   timestamp1  45sec   True   1      timestamp1-timestamp2         
2   1  math    add     a   timestamp2  45sec   True   1      timestamp1-timestamp2
3   1  math    add     a   timestamp2  30sec   False   NaN
4   1  math    add     a   timestamp3  15sec   False   NaN
5   1  math    add     b   timestamp1  0sec    True    2     timestamp1-timestamp4
6   1  math    add     b   timestamp4  0sec    True    2     timestamp1-timestamp4
7   1  math    add     b   timestamp1  45sec   True    3     timestamp1-timestamp2
8   1  math    add     b   timestamp2  45sec   True    3     timestamp1-timestamp2

我有一列['is_duplicate']根据dup_columns识别重复项。 我需要创建另一个列['dup_group']唯一标识每个重复的行,方法是为其指定一个唯一的重复组值(1,2,3,...)。最后,我需要dup_group来比较每个duplicate_group中的timestamp值(我使用的是.diff()方法)。

这是我写的代码:

df2= df1.loc[df1['is_duplicated']==True]
def dup_counter():
    for name, group in df11.groupby(dup_columns):
        df[name, df['dupsetnew']]+=1
    return df['dupsetnew']

df11.groupby(dup_columns).apply(dup_counter)

问题1:函数给我错误(我是Python和编程的新手)

为了计算时间戳的差异,我有以下代码:

df['time_diff'] = df.loc[df.dup_indicator == 1,'event_time'].diff()

问题/问题2: .diff是否是我需要的正确方法?

1 个答案:

答案 0 :(得分:0)

这是一种方式。注意我已将df['timestamp']更改为一系列整数来演示原理,但这可以适用于datetime对象。

我们的想法是在元组列表中使用pd.factorize标识组。然后应用前向和后向groupby.diff以获得所需的结果。

df['timestamp'] = [1, 2, 2, 3, 1, 4, 1, 2]

df['dup_group'] = pd.factorize(list(zip(df['id'], df['subj'], df['topic'], 
                                        df['lesson'], df['time'])))[0] + 1

df['time_diff'] = df.groupby('dup_group')['timestamp'].transform(pd.Series.diff)

df['time_diff'] = df['time_diff'].fillna(-df.groupby('dup_group')['timestamp']\
                                            .transform(pd.Series.diff, periods=-1))

#    id  subj topic lesson  timestamp   time  dup_ind  dup_group  time_diff
# 1   1  math   add      a          1  45sec     True          1        1.0
# 2   1  math   add      a          2  45sec     True          1        1.0
# 3   1  math   add      a          2  30sec    False          2        NaN
# 4   1  math   add      a          3  15sec    False          3        NaN
# 5   1  math   add      b          1   0sec     True          4        3.0
# 6   1  math   add      b          4   0sec     True          4        3.0
# 7   1  math   add      b          1  45sec     True          5        1.0
# 8   1  math   add      b          2  45sec     True          5        1.0

来源数据

from numpy import nan

df = pd.DataFrame({'dup_group': {1: 1.0, 2: 1.0, 3: nan, 4: nan, 5: 2.0, 6: 2.0, 7: 3.0, 8: 3.0},
 'dup_ind': {1: True, 2: True, 3: False, 4: False, 5: True, 6: True, 7: True, 8: True},
 'id': {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
 'lesson': {1: 'a', 2: 'a', 3: 'a', 4: 'a', 5: 'b', 6: 'b', 7: 'b', 8: 'b'},
 'subj': {1: 'math', 2: 'math', 3: 'math', 4: 'math', 5: 'math', 6: 'math', 7: 'math', 8: 'math'},
 'time': {1: '45sec', 2: '45sec', 3: '30sec', 4: '15sec', 5: '0sec', 6: '0sec', 7: '45sec', 8: '45sec'},
 'time_diff': {1: 'timestamp1-timestamp2', 2: 'timestamp1-timestamp2', 3: nan, 4: nan, 5: 'timestamp1-timestamp4', 6: 'timestamp1-timestamp4', 7: 'timestamp1-timestamp2', 8: 'timestamp1-timestamp2'},
 'timestamp': {1: 'timestamp1', 2: 'timestamp2', 3: 'timestamp2', 4: 'timestamp3', 5: 'timestamp1', 6: 'timestamp4', 7: 'timestamp1', 8: 'timestamp2'},
 'topic': {1: 'add', 2: 'add', 3: 'add', 4: 'add', 5: 'add', 6: 'add', 7: 'add', 8: 'add'}})