替换数据框Python中的值

时间:2016-12-30 22:10:46

标签: python pandas

Lon_X        Lat_Y
5,234234     6,3234234
5,234234     6,3234234
5,234234     6,3234234
5,234234     6,3234234
5,234234     6,3234234

我在上面的pandas / dataframe中有GPS坐标。然而,这些使用逗号分隔符。使用pandas将这些转换为浮动GPS坐标的最佳方法是什么?

for item in frame.Lon_X:
    float(item.replace(",", ".")) # makes the conversion but does not store it back

我已尝试过iteritems功能,但似乎很慢并且给了我一个警告,我不太明白:

for index, value in frame.Lon_X.iteritems():
    frame.Lon_X[index] = float(value.replace(",", "."))
  

请参阅文档中的警告:   http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy   来自ipykernel导入kernelapp作为app

5 个答案:

答案 0 :(得分:1)

您可以沿着轴原地应用熊猫的矢量化方法:

def to_float_inplace(x):
    x[:] = x.str.replace(',', '.').astype(float)

df.apply(to_float_inplace)

答案 1 :(得分:0)

试试这个:

df.applymap(lambda x: float(x.replace(",", ".")))

编辑:忘记map,因为@Psidom显示

答案 2 :(得分:0)

您可以使用applymap

df[["Lon_X", "Lat_Y"]] = df[["Lon_X", "Lat_Y"]].applymap(lambda x: float(x.replace(",", ".")))
df

enter image description here

以下是关于这些替代方案的一些基准,to_float_inplace明显快于所有其他方法:

数据

df = pd.DataFrame({"Lon_X": ["5,234234" for i in range(1000000)], "Lat_Y": ["6,3234234" for i in range(1000000)]})
# to_float_inplace
def to_float_inplace(x):
    x[:] = x.str.replace(',', '.').astype(float)

%timeit df.apply(to_float_inplace)
# 1 loops, best of 3: 269 ms per loop

# applymap + astype
%timeit df.applymap(lambda x: x.replace(",", ".")).astype(float)
# 1 loops, best of 3: 1.26 s per loop

# to_float
def to_float(x):
    return x.str.replace(',', '.').astype(float)

%timeit df.apply(to_float)
# 1 loops, best of 3: 1.47 s per loop

# applymap + float
%timeit df.applymap(lambda x: float(x.replace(",", ".")))
# 1 loops, best of 3: 1.75 s per loop

# replace with regex
%timeit df.replace(',', '.', regex=True).astype(float)
# 1 loops, best of 3: 1.79 s per loop

答案 3 :(得分:0)

您可以跳过使用“应用”并直接使用replace <{1}}方法替换regex=True

df.replace(',', '.', regex=True).astype(float)

答案 4 :(得分:0)

令人惊讶的是,迭代np系列似乎更快,而不是使用pd.series.str.replace。我用2米行系列进行了以下实验

setup = '''
import pandas as pd
import numpy as np
a = pd.Series(list('aabc') * 500000)
b = a.values.astype(str)
'''

a = '''
a[:] = a.str.replace("b", "d")
'''
b = '''
b[:] = np.char.replace(b, "b", "d")
'''
c = '''
for i, x in enumerate(b):
    if "b" in x:
        b[i] = "d"
'''
a_speed = min(timeit.Timer(a, setup=setup).repeat(7, 5))
b_speed = min(timeit.Timer(b, setup=setup).repeat(7, 5))
c_speed = min(timeit.Timer(c, setup=setup).repeat(7, 5))

结果:

a_speed = 2.3304627019997497

b_speed = 6.832672896000076

c_speed = 1.9407824309996613