我有一个非常庞大的数据集,其中包含每个月每个团队中的成员。我想找到每个团队的新增和删除。因为我的数据集非常大,所以我试图尽可能多地使用内置函数。
我的数据集如下:
month team members
0 0 A X, Y, Z
1 1 A X, Y
2 2 A W, X, Y
3 0 B D, E
4 1 B D, E, F
5 2 B F
它是由以下代码生成的:
num_months = 3
num_teams = 2
obs = num_months*num_teams
df = pd.DataFrame({"month": [i % num_months for i in range(obs)],
"team": ['AB'[i // num_months] for i in range(obs)],
"members": ["X, Y, Z", "X, Y", "W, X, Y", "D, E", "D, E, F", "F"]})
df
结果应如下所示:
month team members additions deletions
0 0 A X, Y, Z None None
1 1 A X, Y None Z
2 2 A W, X, Y W None
3 0 B D, E None None
4 1 B D, E, F F None
5 2 B F None D, E
或使用Python代码
df = pd.DataFrame({"month": [i % num_months for i in range(obs)],
"team": ['AB'[i // num_months] for i in range(obs)],
"members": ["X, Y, Z", "X, Y", "W, X, Y", "D, E", "D, E, F", "F"],
"additions": [None, None, "W", None, "F", None],
"deletions": [None, "Z", None, None, None, "D, E"]
})
立即想到的一种技术是创建一个显示lagged value of members in each group的新列,然后取两列之间的设置差(双向)。
有没有办法使用pandas内置函数来区分列之间的设置差异?
我还应该尝试其他技巧吗?
答案 0 :(得分:5)
set
,groupby
,apply
和shift
。members
转换为set
类型,因为-
是不受支持的操作数,这将导致TypeError
。additions
和deletions
保留为set
类型apply
91.4 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# clean the members column
df.members = df.members.str.replace(' ', '').str.split(',').map(set)
# create del and add
df['deletions'] = df.groupby('team')['members'].apply(lambda x: x.shift() - x)
df['additions'] = df.groupby('team')['members'].apply(lambda x: x - x.shift())
# result
month team members additions deletions
0 A {Z, X, Y} NaN NaN
1 A {X, Y} {} {Z}
2 A {W, X, Y} {W} {}
0 B {D, E} NaN NaN
1 B {D, F, E} {F} {}
2 B {F} {} {D, E}
pandas.DataFrame.diff
60.7 ms ± 3.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['deletions'] = df.groupby('team')['members'].diff(periods=-1).shift()
df['additions'] = df.groupby('team')['members'].diff()
答案 1 :(得分:0)
这是一种方法。不知道这是否是最有效的。我发现仅通过查看代码来优化熊猫性能并不是那么简单。
我采用的策略是分别计算删除和添加,然后以某种方式将该信息合并回原始DataFrame中。
此解决方案假定输入DataFrame按(团队,月份)排序。如果没有,则需要首先执行该操作。
def set_diff_adds(x):
retval = {}
for m, b, a in zip(x.month.iloc[1:], x.members.iloc[1:], x.members):
retval[m] = (set(b.replace(' ', '').split(',')) -
set(a.replace(' ', '').split(',')))
return retval
def set_diff_dels(x):
retval = {}
for m, b, a in zip(x.month.iloc[1:], x.members.iloc[1:], x.members):
retval[m] = (set(a.replace(' ', '').split(',')) -
set(b.replace(' ', '').split(',')))
return retval
deletions = df.groupby('team').apply(set_diff_dels).apply(pd.Series)
deletions.columns.set_names('month', inplace=True)
deletions = deletions.stack().to_frame('deletions').reset_index()
merged = df.merge(deletions, how='outer')
additions = df.groupby('team').apply(set_diff_adds).apply(pd.Series)
additions.columns.set_names('month', inplace=True)
additions = additions.stack().to_frame('additions').reset_index()
merged = merged.merge(additions, how='outer')
merged
month team members deletions additions
0 0 A X, Y, Z NaN NaN
1 1 A X, Y {Z} {}
2 2 A W, X, Y {} {W}
3 0 B D, E NaN NaN
4 1 B D, E, F {} {F}
5 2 B F {D, E} {}