如何用Pandas数据框中的中位数替换离群值?

时间:2019-03-20 19:02:17

标签: python pandas median outliers

这是我的数据框:

cars_num_df.head(10)

    mpg cylinders   displacement    horsepower  weight  acceleration    age
0   18.0    8          307.0          130.0     3504.0     12.0         13
1   15.0    8          350.0          165.0     3693.0     11.5         13
2   18.0    8          318.0          150.0     3436.0     11.0         13
3   16.0    8          304.0          150.0     3433.0     12.0         13
4   17.0    8          302.0          140.0     3449.0     10.5         13
5   15.0    8          429.0          198.0     4341.0     10.0         13
6   14.0    8          454.0          220.0     4354.0      9.0         13
7   14.0    8          440.0          215.0     4312.0      8.5         13
8   14.0    8          455.0          225.0     4425.0     10.0         13
9   15.0    8          390.0          190.0     3850.0      8.5         13

稍后,我已经使用Zscore对数据进行了标准化,然后我想用每列的中位数替换异常值(而不是移除)。

我尝试这样做:

median = cars_numz_df.median()
std = cars_numz_df.std()
value = cars_numz_df

outliers = (value - median).abs() > 2*std

cars_numz_df[outliers] = cars_numz_df[outliers].abs()

cars_numz_df[outliers]


    mpg cylinders   displacement    horsepower  weight  acceleration    age
0   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
1   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
2   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
3   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
4   NaN 1.498191    NaN             NaN         NaN     NaN             NaN
5   NaN 1.498191    2.262118        2.454408    NaN     NaN             NaN
6   NaN 1.498191    2.502182        3.030708    NaN     2.384735        NaN
7   NaN 1.498191    2.367746        2.899730    NaN     2.566274        NaN
8   NaN 1.498191    2.511784        3.161685    NaN     NaN             NaN
9   NaN 1.498191    1.887617        2.244844    NaN     2.566274        NaN

现在,我正在尝试通过以下操作将异常值替换为中值:

cars_numz_df[outliers] = median

但我收到此错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-394-d48a51500f28> in <module>
      9 cars_numz_df[outliers] = cars_numz_df[outliers].abs()
     10 
---> 11 cars_numz_df[outliers] = median
     12 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py         
in __setitem__(self, key, value)
   3112 
   3113         if isinstance(key, DataFrame) or getattr(key, 'ndim', None) 
== 2:
-> 3114             self._setitem_frame(key, value)
   3115         elif isinstance(key, (Series, np.ndarray, list, Index)):
   3116             self._setitem_array(key, value)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py 
in _setitem_frame(self, key, value)
   3161         self._check_inplace_setting(value)
   3162         self._check_setitem_copy()
-> 3163         self._where(-key, value, inplace=True)
   3164 
   3165     def _ensure_valid_index(self, value):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py 
in _where(self, cond, other, inplace, axis, level, errors, try_cast)
   7543 
   7544                 _, other = self.align(other, join='left', axis=axis,
-> 7545                                       level=level, 
fill_value=np.nan)
   7546 
   7547                 # if we are NOT aligned, raise as we cannot where 
index

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py 
in align(self, other, join, axis, level, copy, fill_value, method, limit, 
fill_axis, broadcast_axis)
   3548                                             method=method, 
limit=limit,
   3549                                             fill_axis=fill_axis,
-> 3550                                             
broadcast_axis=broadcast_axis)
   3551 
   3552     @Appender(_shared_docs['reindex'] % _shared_doc_kwargs)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py 
in align(self, other, join, axis, level, copy, fill_value, method, limit, 
fill_axis, broadcast_axis)
   7370                                       copy=copy, 
fill_value=fill_value,
   7371                                       method=method, limit=limit,
-> 7372                                       fill_axis=fill_axis)
   7373         else:  # pragma: no cover
   7374             raise TypeError('unsupported type: %s' % type(other))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py 
in _align_series(self, other, join, axis, level, copy, fill_value, method, 
limit, fill_axis)
   7469                     fdata = fdata.reindex_indexer(join_index, lidx, 
axis=0)
   7470             else:
-> 7471                 raise ValueError('Must specify axis=0 or 1')
   7472 
   7473             if copy and fdata is self._data:

ValueError: Must specify axis=0 or 1

请告知,如何用列中值替换异常值。

1 个答案:

答案 0 :(得分:1)

我无权访问问题中提出的数据集,因此无法构建随机数据集。

import pandas as pd
import random as r
import numpy as np

d = [r.random()*1000 for i in range(0,100)]
df = pd.DataFrame({'Values': d})

median = df['Values'].median()
std = df['Values'].std()
outliers = (df['Values'] - median).abs() > std
df[outliers] = np.nan
df['Values'].fillna(median, inplace=True)

在尝试将异常值筛选到有用的地方时,还应考虑FWIW,剪切和winsorization。