我的目标是下面的输出。
A | B | C | D | E | F |
---|---|---|---|---|---|
0000 | ZZZ | 987 | QW1 | 8 | 前三四列和偏移 |
0000 | ZZZ | 987 | QW1 | -8 | 前三四列和偏移 |
1111 | AAA | 123 | AB1 | 1 | 前三四列和偏移 |
1111 | AAA | 123 | CD1 | -1 | 前三四列和偏移 |
2222 | BBB | 456 | EF1 | -4 | 前三四列和偏移 |
2222 | BBB | 456 | GH1 | -1 | 前三四列和偏移 |
2222 | BBB | 456 | IL1 | 5 | 前三四列和偏移 |
3333 | CCC | 789 | MN1 | 2 | 前两个列和偏移量 |
3333 | CCC | 101 | MN1 | -2 | 前两个列和偏移量 |
4444 | DDD | 121 | UYT | 6 | 前两个列和偏移量 |
4444 | DDD | 131 | FB1 | -5 | 前两个列和偏移量 |
4444 | DDD | 141 | UYT | -1 | 前两个列和偏移量 |
5555 | EE | 151 | CB1 | 3 | 前两个列和偏移量 |
5555 | EE | 161 | CR1 | -3 | 前两个列和偏移量 |
6666 | FFF | 111 | CB1 | 4 | 首次匹配或不匹配 |
7777 | GGG | 222 | ZB1 | 10.5 | 前三四列和小偏移 |
7777 | GGG | 222 | ZB1 | -10 | 前三四列和小偏移 |
第一条规则)前三列必须彼此相等 - 无论第四列如何,可以相等也可以不相等。每个组合必须将关联的数字 (col E) 偏移为零(可以组合 2 到 X 条记录)。
第二条规则)前两列必须彼此相等 - 无论第四列如何,可以相等也可以不相等。每个组合必须将关联的数字 (col E) 偏移为零(可以组合 2 到 X 条记录)。
第三条规则)不匹配。
第四条规则)前三列必须彼此相等 - 无论第四列如何,可以相等也可以不相等。每个组合可以有 0.5
AT MOST (col E) 的差异,并且没有偏移为零(可以组合 2 到 X 条记录)。
请看下面我的代码。
我完全意识到我没有以最有效的方式编写代码。您能否建议一种更有效的方法来实现这一目标?
for i in range(0, len(df)-1):
for j in range(i+1, len(df)):
if (df['A'][i] == df['A'][j]) & (df['B'][i] == df['B'][j]) & (df['C'][i] == df['C'][j]) & (df['E'][i] + df['E'][j] == 0) :
df['E'][i] = 'first three-four col and offset'
df['E'][j] = 'first three-four col and offset'
for i in range(0, len(df)-1):
for j in range(i+1, len(df)):
if (df['A'][i] == df['A'][j]) & (df['B'][i] == df['B'][j]) & (df['E'][i] + df['E'][j] == 0) & (df['E'][i] != 'first three-four col and offset') & (df['E'][j] != 'first three-four col and offset'):
df['E'][i] = 'first two col and offset'
df['E'][j] = 'first two col and offset'
for i in range(0, len(df)-1):
for j in range(i+1, len(df)):
if (df['A'][i] == df['A'][j]) & (df['B'][i] == df['B'][j]) & (df['C'][i] == df['C'][j]) & (df['E'][i] + df['E'][j] != 0) & (df['E'][i] + df['E'][j] =< 0.5) & (df['E'][i] + df['E'][j] >= -0.5) & (df['E'][i] != 'first three-four col and offset') & (df['E'][j] != 'first three-four col and offset') & (df['E'][i] != 'first two col and offset') & (df['E'][j] != 'first two col and offset'):
df['E'][i] = 'first three-four col and small offset'
df['E'][j] = 'first three-four col and small offset'
有没有办法以更有效的方式获得预期的结果?
我也知道以下代码不起作用。我尝试用正确的评论更新这条记录,但徒劳无功。
for ... :
if.... :
df['col'][index] = 'comment'
让我们进一步假设我想以这种“效率不高的方式”保留我的代码,这似乎有效(除了最后一行代码)。我应该如何更改最后一行以使我的脚本正常工作?
答案 0 :(得分:3)
groupby
+ transform
和 np.select
m1 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').eq(0) # Rule 1
m2 = df.groupby(['A', 'B'])['E'].transform('sum').eq(0) # Rule 2
m3 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').abs().le(0.5) # Rule 4
df['new'] = np.select([m1, m2, m3], ['first three-four col and offset',
'first two col and offset', 'first three-four col and small offset'], 'first or no match')
A B C D E F new
0 0000 ZZZ 987 QW1 8.0 first three-four col and offset first three-four col and offset
1 0000 ZZZ 987 QW1 -8.0 first three-four col and offset first three-four col and offset
2 1111 AAA 123 AB1 1.0 first three-four col and offset first three-four col and offset
3 1111 AAA 123 CD1 -1.0 first three-four col and offset first three-four col and offset
4 2222 BBB 456 EF1 -4.0 first three-four col and offset first three-four col and offset
5 2222 BBB 456 GH1 -1.0 first three-four col and offset first three-four col and offset
6 2222 BBB 456 IL1 5.0 first three-four col and offset first three-four col and offset
7 3333 CCC 789 MN1 2.0 first two col and offset first two col and offset
8 3333 CCC 101 MN1 -2.0 first two col and offset first two col and offset
9 4444 DDD 121 UYT 6.0 first two col and offset first two col and offset
10 4444 DDD 131 FB1 -5.0 first two col and offset first two col and offset
11 4444 DDD 141 UYT -1.0 first two col and offset first two col and offset
12 5555 EEE 151 CB1 3.0 first two col and offset first two col and offset
13 5555 EEE 161 CR1 -3.0 first two col and offset first two col and offset
14 6666 FFF 111 CB1 4.0 first or no match first or no match
15 7777 GGG 222 ZB1 10.5 first three-four col and small offset first three-four col and small offset
16 7777 GGG 222 ZB1 -10.0 first three-four col and small offset first three-four col and small offset