Question

（Python 2.7，pandas 0.13.0）

背景：我从CSV文件中读取了大量数据并将其加载到pandas数据帧中。一些数据很复杂（我在加载时将其从字符串转换）。一些值是设备错误，因为太大而区分。我想用np.nan替换幅度大于某个阈值的所有值。使用numpy数组很容易（如果你使用“复杂的nan”，如图所示），但在熊猫中一直很有挑战性。我已经记录了我在下面尝试过的步骤 - 最后一次尝试几乎到达那里，但是发生替换的任何行都会转换为实际。

此时我想到的只是将值拉成一个numpy数组，修改，然后加载回数据帧，但这似乎相当不优雅。

编辑：下面的解决方案有效，但我想知道在我编写的代码中，pandas处理NaN。的方式是否仍然存在错误。看起来创建的NaN是nan +0.j而不是nan +nanj。 Matplotlib会在没有问题的情况下绘制后者，如果您正在执行类似plot(np.real(signal), np.imag(signal))的操作，但不喜欢前者，因为它正在绘制（Nan，0）对。看起来我需要用nan +0j条目替换新的nan +nanj条目，这会以递归方式重新启动问题。：）

EDIT2：NaN似乎确实存在视觉差异，但我发现的新错误与这种差异无关。差异可能并不重要。上面有不正确的事情。

# begin by making a fake data set that resembles the CSV struction headers = ['Z1', 'Z2', 'Z3'] temp = np.arange(12).reshape((4,3)) + 1j*np.arange(12,24).reshape((4,3)) temp[0,1] = 5000 + 1j*5000 temp[1,1] = 5000 + 1j*8000 temp[2,2] = 7000 + 1j*3000 junk = ['exists to', 'make life', 'extra', 'difficult'] df_junk = pd.DataFrame(data=junk, columns=['other junk']) df = pd.DataFrame(data=temp, columns=headers) df = pd.concat((df, df_junk), axis=1) # very simple to do this in an np.array if we only take the numbers temp2 = np.copy(temp) # temp2 is the desired result, but in the frame with everything else temp2[ np.abs(temp2) > 5000 ] = np.nan + 1j*np.nan df2 = df.copy() # Executing the next line replaces the value with NaN, # but turns all of column Z2 into real numbers #df[np.abs(df[headers]) > 5000 ] = np.nan + 1j*np.nan # Trying to grab the index first gives # ValueError: Cannot index with multidimensional key #df.ix[np.abs(df[headers]) > 5000 ] for column in headers: # The following line would turn the entire 3rd row into NaN # df[np.abs(df[column]) > 5000] = np.nan + 1j*np.nan # Attempts along these lines to apply a lambda (tried different ones) # didn't seem to work #csv_data[column] = csv_data[column].apply(lambda x:\ # pd.replace(x, np.nan) if abs(x) > 5000 else pd.replace(x,x)) # This last one almost works, but again turns columns with replacements into reals df2[column].where(abs(df2[column]) <= 5000, np.nan+1j*np.nan, inplace=True) Z1 Z2 Z3 other junk 0 12j NaN 2 exists to 1 (3+15j) NaN 5 make life 2 (6+18j) 7 NaN extra 3 (9+21j) 10 11 difficult

Answer 1

看起来没有inplace标志就可以了：

In [11]: df3 = df2[['Z1', 'Z2', 'Z3']]

In [12]: df3.where(df3 <= 5000)  # replaces by NaN by default
Out[12]:
        Z1        Z2        Z3
0      12j       NaN   (2+14j)
1  (3+15j)       NaN   (5+17j)
2  (6+18j)   (7+19j)       NaN
3  (9+21j)  (10+22j)  (11+23j)

In [13]: df2[['Z1', 'Z2', 'Z3']] = df3.where(df3 <= 5000)

一般来说，我认为避免使用inplace标志是一个好主意（虽然这可能是bug）：

In [21]: df3.where(df3 <= 5000, inplace=True)

In [22]: df3
Out[22]:
        Z1  Z2  Z3
0      12j NaN   2
1  (3+15j) NaN   5
2  (6+18j)   7 NaN
3  (9+21j)  10  11

替换pandas DataFrame中的越界（复杂）值

1 个答案: