这是数据 -
Account_Number Dummy_Account
1050080713252 ACC0000000000001
1050223213427 ACC0000000000002
1050080713252 ACC0000000169532
1105113502309 ACC0000000123005
1100043521537 ACC0000000000004
1100045301840 ACC0000000000005
1105113502309 ACC0000000000040
行1,3
在Account_Number
中有重复值。行4,7
也是如此。
我需要将Account_Number
中的重复值替换为Dummy_Account
中的相同值。因此,对于1050080713252
,两个行1,3
应具有相同的虚拟值ACC0000000000001
。但不是直接替换,我想保留原始映射。
我的预期输出是 -
Account_Number_Map Dummy_Account_Original
ACC0000000000001 ACC0000000000001
ACC0000000000002 ACC0000000000002
ACC0000000000001 ACC0000000169532
ACC0000000123005 ACC0000000123005
ACC0000000000004 ACC0000000000004
ACC0000000000005 ACC0000000000005
ACC0000000123005 ACC0000000000040
由于ACC0000000169532
是重复的Dummy_Account
w.r.t Account_Number
,我想创建一个用ACC0000000000001
替换它的查找
我尝试了什么
我开始创建像这样的dict
-
maps = dict(zip(df.Dummy_Account, df.Account_Number))
我想创建一个dict
,其原始Dummy_Account
值为key
,新Dummy_Account
值为value
但我有点失落。我的数据集很大,所以我也在寻找优化的解决方案。
答案 0 :(得分:4)
选项1
我将groupby
和transform
与first
一起使用
transform
将在所有实例中广播第一个遇到的值
小组。
df.assign(
Account_Number=
df.groupby('Account_Number')
.Dummy_Account
.transform('first')
)
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
选项2
使用Numpy的np.unique
得到第一个值的索引和反转
索引(idx
)标识'Account_Number'
的第一个唯一位置发生的位置。我用它来切片'Dummy_Account'
。然后我使用反向数组(inv
)用于将唯一值放回原位,而是将其用于重合数组中相同位置的事物。
u, idx, inv = np.unique(
df.Account_Number.values,
return_index=True,
return_inverse=True
)
df.assign(
Account_Number=
df.Dummy_Account.values[idx][inv]
)
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
选项3
或者使用pd.factorize
和pd.Series.duplicated
与选项2类似,但我让duplicated
扮演识别第一个值的位置的角色。然后,我将重合值与生成的布尔数组切片,然后将其反转为pd.factorize
的结果。 f
与选项2中的inv
扮演相同的角色。
d = ~df.Account_Number.duplicated().values
f, u = pd.factorize(df.Account_Number.values)
df.assign(
Account_Number=
df.Dummy_Account.values[d][f]
)
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
时间测试
结果
res.plot(loglog=True)
res.div(res.min(1), 0)
jez cold pir1 pir2 pir3
10 4.524811 3.819322 2.870916 1.000000 1.238144
30 4.833144 4.093932 3.310285 1.000000 1.382189
100 4.863337 4.048008 3.146154 1.000000 1.320060
300 5.144460 3.894850 3.157636 1.000000 1.357779
1000 4.870499 3.742524 2.348021 1.000000 1.069559
3000 5.375105 3.432398 1.852771 1.126024 1.000000
10000 7.100372 4.335100 1.890134 1.551161 1.000000
30000 7.227139 3.993985 1.530002 1.594531 1.000000
100000 8.052324 3.811728 1.380440 1.708170 1.000000
300000 8.690613 4.204664 1.539624 1.942090 1.000000
1000000 7.787494 3.668117 1.498758 2.129085 1.000000
设置
def jez(d):
v = d.sort_values('Account_Number')
v['Account_Number'] = v['Dummy_Account'].mask(v.duplicated('Account_Number')).ffill()
return v.sort_index()
def cold(d):
m = d.drop_duplicates('Account_Number', keep='first')\
.set_index('Account_Number')\
.Dummy_Account
return d.assign(Account_Number=d.Account_Number.map(m))
def pir1(d):
return d.assign(
Account_Number=
d.groupby('Account_Number')
.Dummy_Account
.transform('first')
)
def pir2(d):
u, idx, inv = np.unique(
d.Account_Number.values,
return_index=True,
return_inverse=True
)
return d.assign(
Account_Number=
d.Dummy_Account.values[idx][inv]
)
def pir3(d):
p = ~d.Account_Number.duplicated().values
f, u = pd.factorize(d.Account_Number.values)
return d.assign(
Account_Number=
d.Dummy_Account.values[p][f]
)
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000,
30000, 100000, 300000, 1000000],
columns='jez cold pir1 pir2 pir3'.split(),
dtype=float
)
np.random.seed([3, 1415])
for i in res.index:
d = pd.DataFrame(dict(
Account_Number=np.random.randint(i // 2, size=i),
Dummy_Account=range(i)
))
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import {j}, d'
res.at[i, j] = timeit(stmt, setp, number=100)
答案 1 :(得分:3)
使用sort_values
并将重复的值替换为ffill
:
df = df.sort_values('Account_Number')
df['Account_Number'] = df['Dummy_Account'].mask(df.duplicated('Account_Number')).ffill()
df = df.sort_index()
print (df)
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
答案 2 :(得分:3)
使用drop_duplicates
,创建一个您将传递给map
的系列:
m = df.drop_duplicates('Account_Number', keep='first')\
.set_index('Account_Number')\
.Dummy_Account
df.Account_Number = df.Account_Number.map(m)
df
Account_Number Dummy_Account
0 ACC0000000000001 ACC0000000000001
1 ACC0000000000002 ACC0000000000002
2 ACC0000000000001 ACC0000000169532
3 ACC0000000123005 ACC0000000123005
4 ACC0000000000004 ACC0000000000004
5 ACC0000000000005 ACC0000000000005
6 ACC0000000123005 ACC0000000000040
<强>计时强>
df = pd.concat([df] * 1000000, ignore_index=True)
# jezrael's solution
%%timeit
v = df.sort_values('Account_Number')
v['Account_Number'] = v['Dummy_Account'].mask(v.duplicated('Account_Number')).ffill()
v.sort_index()
315 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# in this post
%%timeit
m = df.drop_duplicates('Account_Number', keep='first')\
.set_index('Account_Number')\
.Dummy_Account
df.Account_Number.map(m)
163 ms ± 3.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
请注意,性能取决于您的实际数据。