我有一个很大的df,其列的子集相同dup_columns = ['id', 'subject','topic', 'lesson', 'time']
,有些列是唯一的['timestamps'].
id subj topic lesson timestamp time dup_ind dup_group time_diff
1 1 math add a timestamp1 45sec True 1 timestamp1-timestamp2
2 1 math add a timestamp2 45sec True 1 timestamp1-timestamp2
3 1 math add a timestamp2 30sec False NaN
4 1 math add a timestamp3 15sec False NaN
5 1 math add b timestamp1 0sec True 2 timestamp1-timestamp4
6 1 math add b timestamp4 0sec True 2 timestamp1-timestamp4
7 1 math add b timestamp1 45sec True 3 timestamp1-timestamp2
8 1 math add b timestamp2 45sec True 3 timestamp1-timestamp2
我有一列['is_duplicate']
根据dup_columns识别重复项。
我需要创建另一个列['dup_group']
,唯一标识每个重复的行,方法是为其指定一个唯一的重复组值(1,2,3,...)。最后,我需要dup_group
来比较每个duplicate_group中的timestamp
值(我使用的是.diff()方法)。
这是我写的代码:
df2= df1.loc[df1['is_duplicated']==True]
def dup_counter():
for name, group in df11.groupby(dup_columns):
df[name, df['dupsetnew']]+=1
return df['dupsetnew']
df11.groupby(dup_columns).apply(dup_counter)
问题1:函数给我错误(我是Python和编程的新手)
为了计算时间戳的差异,我有以下代码:
df['time_diff'] = df.loc[df.dup_indicator == 1,'event_time'].diff()
问题/问题2: .diff是否是我需要的正确方法?
答案 0 :(得分:0)
这是一种方式。注意我已将df['timestamp']
更改为一系列整数来演示原理,但这可以适用于datetime
对象。
我们的想法是在元组列表中使用pd.factorize
标识组。然后应用前向和后向groupby.diff
以获得所需的结果。
df['timestamp'] = [1, 2, 2, 3, 1, 4, 1, 2]
df['dup_group'] = pd.factorize(list(zip(df['id'], df['subj'], df['topic'],
df['lesson'], df['time'])))[0] + 1
df['time_diff'] = df.groupby('dup_group')['timestamp'].transform(pd.Series.diff)
df['time_diff'] = df['time_diff'].fillna(-df.groupby('dup_group')['timestamp']\
.transform(pd.Series.diff, periods=-1))
# id subj topic lesson timestamp time dup_ind dup_group time_diff
# 1 1 math add a 1 45sec True 1 1.0
# 2 1 math add a 2 45sec True 1 1.0
# 3 1 math add a 2 30sec False 2 NaN
# 4 1 math add a 3 15sec False 3 NaN
# 5 1 math add b 1 0sec True 4 3.0
# 6 1 math add b 4 0sec True 4 3.0
# 7 1 math add b 1 45sec True 5 1.0
# 8 1 math add b 2 45sec True 5 1.0
来源数据
from numpy import nan
df = pd.DataFrame({'dup_group': {1: 1.0, 2: 1.0, 3: nan, 4: nan, 5: 2.0, 6: 2.0, 7: 3.0, 8: 3.0},
'dup_ind': {1: True, 2: True, 3: False, 4: False, 5: True, 6: True, 7: True, 8: True},
'id': {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
'lesson': {1: 'a', 2: 'a', 3: 'a', 4: 'a', 5: 'b', 6: 'b', 7: 'b', 8: 'b'},
'subj': {1: 'math', 2: 'math', 3: 'math', 4: 'math', 5: 'math', 6: 'math', 7: 'math', 8: 'math'},
'time': {1: '45sec', 2: '45sec', 3: '30sec', 4: '15sec', 5: '0sec', 6: '0sec', 7: '45sec', 8: '45sec'},
'time_diff': {1: 'timestamp1-timestamp2', 2: 'timestamp1-timestamp2', 3: nan, 4: nan, 5: 'timestamp1-timestamp4', 6: 'timestamp1-timestamp4', 7: 'timestamp1-timestamp2', 8: 'timestamp1-timestamp2'},
'timestamp': {1: 'timestamp1', 2: 'timestamp2', 3: 'timestamp2', 4: 'timestamp3', 5: 'timestamp1', 6: 'timestamp4', 7: 'timestamp1', 8: 'timestamp2'},
'topic': {1: 'add', 2: 'add', 3: 'add', 4: 'add', 5: 'add', 6: 'add', 7: 'add', 8: 'add'}})