我希望在python3中以有效的方式使用replace
函数。我的代码是完成任务,但是因为我正在使用大型数据集,因此速度太慢了。因此,无论何时需要权衡,我的优先权都是优于效率。这是我想做的玩具:
import pandas as pd
df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])
1st 2nd
0 1 2
1 3 4
2 5 6
idxDict= dict()
idxDict[1] = 'a'
idxDict[3] = 'b'
idxDict[5] = 'c'
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
哪个给出了
1st 2nd
0 a 2
1 b 4
2 c 6
正如我所愿,但它需要太长时间。什么是最快的方式?
编辑:这是一个比this问题更集中,更清晰的问题,解决方案类似。
答案 0 :(得分:4)
使用map
执行查找:
In [46]:
df['1st'] = df['1st'].map(idxDict)
df
Out[46]:
1st 2nd
0 a 2
1 b 4
2 c 6
为了避免没有有效密钥的情况,您可以通过na_action='ignore'
您也可以使用df['1st'].replace(idxDict)
,但要回答有关效率的问题:
<强>定时强>
In [69]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)
1000 loops, best of 3: 1.57 ms per loop
1000 loops, best of 3: 1.08 ms per loop
In [70]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
100 loops, best of 3: 3.25 ms per loop
因此使用map
的速度提高了3倍
在更大的数据集上:
In [3]:
df = pd.concat([df]*10000, ignore_index=True)
df.shape
Out[3]:
(30000, 2)
In [4]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)
100 loops, best of 3: 18 ms per loop
100 loops, best of 3: 4.31 ms per loop
In [5]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
100 loops, best of 3: 18.2 ms per loop
对于30K行df,map
快〜4倍,因此比replace
或循环
答案 1 :(得分:1)
虽然map
确实更快,但在版本19.2(replace
)中更新details here以提高其速度,从而显着减少差异:
In [1]:
import pandas as pd
df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])
df = pd.concat([df]*10000, ignore_index=True)
df.shape
Out [1]:
(30000, 2)
In [2]:
idxDict = {1:'a', 3:"b", 5:"c"}
%timeit df['1st'].replace(idxDict, inplace=True)
%timeit df['1st'].update(df['1st'].map(idxDict))
Out [2]:
100 loops, best of 3: 12.8 ms per loop
100 loops, best of 3: 7.95 ms per loop
此外,我修改了EdChum的地图代码以包含update
,虽然速度较慢,但可以防止未包含在不完整地图中的值更改为nans。
答案 2 :(得分:1)
如果不需要NaN传播-您想替换值但保留字典中不匹配的值-还有两个选择:
def numpy_series_replace(series: pd.Series, mapping: dict) -> pd.Series:
"""Replace values in a series according to a mapping."""
result = series.copy().values
for k, v in mapping.items():
result[series.values==k] = v
return pd.Series(result, index=series.index)
或
def apply_series_replace(series: pd.Series, mapping: dict) -> pd.Series:
return series.apply(lambda y: mapping.get(y,y))
numpy的实现有点怪异,但速度更快。
v = pd.Series(np.random.randint(0, 10, 1000000))
mapper = {0: 1, 3: 2}
%timeit numpy_series_replace(v, mapper)
60.1 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit apply_series_replace(v, mapper)
311 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)