(Python 2.7,pandas 0.13.0)
背景:我从CSV文件中读取了大量数据并将其加载到pandas数据帧中。一些数据很复杂(我在加载时将其从字符串转换)。一些值是设备错误,因为太大而区分。我想用np.nan替换幅度大于某个阈值的所有值。使用numpy数组很容易(如果你使用“复杂的nan”,如图所示),但在熊猫中一直很有挑战性。我已经记录了我在下面尝试过的步骤 - 最后一次尝试几乎到达那里,但是发生替换的任何行都会转换为实际。
此时我想到的只是将值拉成一个numpy数组,修改,然后加载回数据帧,但这似乎相当不优雅。
编辑:下面的解决方案有效,但我想知道在我编写的代码中,pandas处理NaN。的方式是否仍然存在错误nan +0.j
而不是nan +nanj
。 plot(np.real(signal), np.imag(signal))
的操作,但不喜欢前者,因为它正在绘制(Nan,0)对。看起来我需要用nan +0j
条目替换新的nan +nanj
条目,这会以递归方式重新启动问题。 :)EDIT2:NaN似乎确实存在视觉差异,但我发现的新错误与这种差异无关。差异可能并不重要。上面有不正确的事情。
# begin by making a fake data set that resembles the CSV struction
headers = ['Z1', 'Z2', 'Z3']
temp = np.arange(12).reshape((4,3)) + 1j*np.arange(12,24).reshape((4,3))
temp[0,1] = 5000 + 1j*5000
temp[1,1] = 5000 + 1j*8000
temp[2,2] = 7000 + 1j*3000
junk = ['exists to', 'make life', 'extra', 'difficult']
df_junk = pd.DataFrame(data=junk, columns=['other junk'])
df = pd.DataFrame(data=temp, columns=headers)
df = pd.concat((df, df_junk), axis=1)
# very simple to do this in an np.array if we only take the numbers
temp2 = np.copy(temp)
# temp2 is the desired result, but in the frame with everything else
temp2[ np.abs(temp2) > 5000 ] = np.nan + 1j*np.nan
df2 = df.copy()
# Executing the next line replaces the value with NaN,
# but turns all of column Z2 into real numbers
#df[np.abs(df[headers]) > 5000 ] = np.nan + 1j*np.nan
# Trying to grab the index first gives
# ValueError: Cannot index with multidimensional key
#df.ix[np.abs(df[headers]) > 5000 ]
for column in headers:
# The following line would turn the entire 3rd row into NaN
# df[np.abs(df[column]) > 5000] = np.nan + 1j*np.nan
# Attempts along these lines to apply a lambda (tried different ones)
# didn't seem to work
#csv_data[column] = csv_data[column].apply(lambda x:\
# pd.replace(x, np.nan) if abs(x) > 5000 else pd.replace(x,x))
# This last one almost works, but again turns columns with replacements into reals
df2[column].where(abs(df2[column]) <= 5000, np.nan+1j*np.nan, inplace=True)
Z1 Z2 Z3 other junk
0 12j NaN 2 exists to
1 (3+15j) NaN 5 make life
2 (6+18j) 7 NaN extra
3 (9+21j) 10 11 difficult
答案 0 :(得分:1)
看起来没有inplace标志就可以了:
In [11]: df3 = df2[['Z1', 'Z2', 'Z3']]
In [12]: df3.where(df3 <= 5000) # replaces by NaN by default
Out[12]:
Z1 Z2 Z3
0 12j NaN (2+14j)
1 (3+15j) NaN (5+17j)
2 (6+18j) (7+19j) NaN
3 (9+21j) (10+22j) (11+23j)
In [13]: df2[['Z1', 'Z2', 'Z3']] = df3.where(df3 <= 5000)
一般来说,我认为避免使用inplace标志是一个好主意(虽然这可能是bug):
In [21]: df3.where(df3 <= 5000, inplace=True)
In [22]: df3
Out[22]:
Z1 Z2 Z3
0 12j NaN 2
1 (3+15j) NaN 5
2 (6+18j) 7 NaN
3 (9+21j) 10 11