我正在处理大约400k行预处理字符串的数据集。
[In]:
raw preprocessed
helpersstreet 46, second floor helpersstreet 46
489 john doe route john doe route
at main street 49 main street
“preprocessed”列中的所有字符串都与“raw”列相同或更小。有没有一种快速的方法来比较这些字符串并返回所有差异,将它们放在一列中:
[Out]:
raw preprocessed difference
helpersstreet 46, second floor helpersstreet 46 ,second floor
489 john doe route john doe route 489
at main street 49 main street at 49
我不确定如何做到这一点,但我也想知道这是否可行。我可以访问执行预处理的函数,因此修改它们以更快地返回这些值,或者是稍后创建差异的可扩展方法。我更喜欢后者。
答案 0 :(得分:4)
选项1
似乎是按顺序迭代替换。您可以使用列表理解
lambda
鉴于此问题的局限性(矢量化替换操作所涉及的困难),我认为这是您最快的选择。
选项2
或者,f = np.vectorize(lambda i, j: i.replace(j, ''))
df['difference'] = f(df.raw, df.preprocessed)
df
raw preprocessed difference
0 helpersstreet 46, second floor helpersstreet 46 , second floor
1 489 john doe route john doe route 489
2 at main street 49 main street at 49
一个apply
,
df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1)
df
raw preprocessed difference
0 helpersstreet 46, second floor helpersstreet 46 , second floor
1 489 john doe route john doe route 489
2 at main street 49 main street at 49
请注意,这只会隐藏循环,它与选项1 一样快/慢,如果不是更糟。
选项3
使用df = pd.concat([df] * 10000, ignore_index=True) # setup
,我不建议:
# Option 1
%timeit df['difference'] = [i.replace(j, '') for i, j in zip(df.raw, df.preprocessed)]
186 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Option 2
%timeit df['difference'] = f(df.raw, df.preprocessed)
326 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Option 3
%timeit df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1)
20.8 s ± 237 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
这个也隐藏了循环,但代价是比选项2 更多的开销。
<强>计时强>
应我的朋友jezrael先生的要求:
{{1}}
{{1}}