替换pandas DataFrame中的越界(复杂)值

时间:2014-02-13 21:46:06

标签: python numpy pandas

(Python 2.7,pandas 0.13.0)

背景:我从CSV文件中读取了大量数据并将其加载到pandas数据帧中。一些数据很复杂(我在加载时将其从字符串转换)。一些值是设备错误,因为太大而区分。我想用np.nan替换幅度大于某个阈值的所有值。使用numpy数组很容易(如果你使用“复杂的nan”,如图所示),但在熊猫中一直很有挑战性。我已经记录了我在下面尝试过的步骤 - 最后一次尝试几乎到达那里,但是发生替换的任何行都会转换为实际。

此时我想到的只是将值拉成一个numpy数组,修改,然后加载回数据帧,但这似乎相当不优雅。

编辑:下面的解决方案有效,但我想知道在我编写的代码中,pandas处理NaN。的方式是否仍然存在错误。看起来创建的NaN是nan +0.j而不是nan +nanj Matplotlib会在没有问题的情况下绘制后者,如果您正在执行类似plot(np.real(signal), np.imag(signal))的操作,但不喜欢前者,因为它正在绘制(Nan,0)对。看起来我需要用nan +0j条目替换新的nan +nanj条目,这会以递归方式重新启动问题。 :)

EDIT2:NaN似乎确实存在视觉差异,但我发现的新错误与这种差异无关。差异可能并不重要。上面有不正确的事情。

# begin by making a fake data set that resembles the CSV struction
headers = ['Z1', 'Z2', 'Z3']
temp = np.arange(12).reshape((4,3)) + 1j*np.arange(12,24).reshape((4,3))
temp[0,1] = 5000 + 1j*5000
temp[1,1] = 5000 + 1j*8000
temp[2,2] = 7000 + 1j*3000
junk = ['exists to', 'make life', 'extra', 'difficult']
df_junk = pd.DataFrame(data=junk, columns=['other junk'])
df = pd.DataFrame(data=temp, columns=headers)
df = pd.concat((df, df_junk), axis=1)
# very simple to do this in an np.array if we only take the numbers
temp2 = np.copy(temp)
# temp2 is the desired result, but in the frame with everything else
temp2[ np.abs(temp2) > 5000 ] = np.nan + 1j*np.nan
df2 = df.copy()

# Executing the next line replaces the value with NaN,
# but turns all of column Z2 into real numbers
#df[np.abs(df[headers]) > 5000 ] = np.nan + 1j*np.nan
# Trying to grab the index first gives
# ValueError: Cannot index with multidimensional key
#df.ix[np.abs(df[headers]) > 5000 ]
for column in headers:
    # The following line would turn the entire 3rd row into NaN
    # df[np.abs(df[column]) > 5000] = np.nan + 1j*np.nan
    # Attempts along these lines to apply a lambda (tried different ones)
    # didn't seem to work
    #csv_data[column] = csv_data[column].apply(lambda x:\
    # pd.replace(x, np.nan) if abs(x) > 5000 else pd.replace(x,x))
    # This last one almost works, but again turns columns with replacements into reals
    df2[column].where(abs(df2[column]) <= 5000, np.nan+1j*np.nan, inplace=True)

        Z1  Z2  Z3 other junk
0      12j NaN   2  exists to
1  (3+15j) NaN   5  make life
2  (6+18j)   7 NaN      extra
3  (9+21j)  10  11  difficult

1 个答案:

答案 0 :(得分:1)

看起来没有inplace标志就可以了:

In [11]: df3 = df2[['Z1', 'Z2', 'Z3']]

In [12]: df3.where(df3 <= 5000)  # replaces by NaN by default
Out[12]:
        Z1        Z2        Z3
0      12j       NaN   (2+14j)
1  (3+15j)       NaN   (5+17j)
2  (6+18j)   (7+19j)       NaN
3  (9+21j)  (10+22j)  (11+23j)

In [13]: df2[['Z1', 'Z2', 'Z3']] = df3.where(df3 <= 5000)

一般来说,我认为避免使用inplace标志是一个好主意(虽然这可能是bug):

In [21]: df3.where(df3 <= 5000, inplace=True)

In [22]: df3
Out[22]:
        Z1  Z2  Z3
0      12j NaN   2
1  (3+15j) NaN   5
2  (6+18j)   7 NaN
3  (9+21j)  10  11