Question

我有一个如下数据框：

 print(df.head(10))

 day         CO2
   1  549.500000
   2  663.541667
   3  830.416667
   4  799.695652
   5  813.850000
   6  769.583333
   7  681.941176
   8  653.333333
   9  845.666667
  10  436.086957

然后，我使用以下函数和代码行从CO2列中获取ouliers：

def estimate_gaussian(dataset):

    mu = np.mean(dataset)#moyenne cf mu
    sigma = np.std(dataset)#écart_type/standard deviation
    limit = sigma * 1.5

    min_threshold = mu - limit
    max_threshold = mu + limit

    return mu, sigma, min_threshold, max_threshold

mu, sigma, min_threshold, max_threshold = estimate_gaussian(df['CO2'].values)


condition1 = (dataset < min_threshold)
condition2 = (dataset > max_threshold)

outliers1 = np.extract(condition1, dataset)
outliers2 = np.extract(condition2, dataset)

outliers = np.concatenate((outliers1, outliers2), axis=0)

哪个给我以下结果：

print(outliers)

[830.41666667 799.69565217 813.85       769.58333333 845.66666667]

现在，我想在散点图中用红色标记那些离群值。

您可以在下面我到目前为止使用的代码中在散点图上用红色标记单个离群值，但是我找不到一种方法可以对离群值列表的每个元素numpy.ndarray进行处理：

y = df['CO2']

x = df['day']

col = np.where(x<0,'k',np.where(y<845.66666667,'b','r'))

plt.scatter(x, y, c=col, s=5, linewidth=3)
plt.show()

这是我得到的，但我希望所有油料油的结果相同。你能帮我吗？

https://ibb.co/Ns9V7Zz

Answer 1

这是一种快速解决方案：

我将重新创建您已经开始的内容。您只共享数据框的头部，但是无论如何，我只是插入了一些随机离群值。看起来您的“ estimate_gaussian（）”函数只能返回两个异常值？

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame([549.500000,
                50.0000000,
                830.416667,
                799.695652,
                1200.00000,
                769.583333,
                681.941176,
                1300.00000,
                845.666667,
                436.086957], 
                columns=['CO2'],
                index=list(range(1,11)))

def estimate_gaussian(dataset):

    mu = np.mean(dataset) # moyenne cf mu
    sigma = np.std(dataset) # écart_type/standard deviation
    limit = sigma * 1.5

    min_threshold = mu - limit
    max_threshold = mu + limit

    return mu, sigma, min_threshold, max_threshold

mu, sigma, min_threshold, max_threshold = estimate_gaussian(df.values)

condition1 = (df < min_threshold)
condition2 = (df > max_threshold)

outliers1 = np.extract(condition1, df)
outliers2 = np.extract(condition2, df)

outliers = np.concatenate((outliers1, outliers2), axis=0)

然后我们将进行绘图：

df_red = df[df.values==outliers]

plt.scatter(df.index,df.values)
plt.scatter(df_red.index,df_red.values,c='red')
plt.show()

让我知道您是否需要更细微的差别！

Answer 2

可能不是最有效的解决方案，但是我觉得多次调用plt.scatter更容易，每次都传递一个xy对。由于我们从不调用新图形（例如，使用plt.figure()），因此每个xy对都绘制在同一图形上。

然后，在每次迭代中，我们只需要检查y值是否为离群值即可。如果是这样，我们将在color调用中更改plt.scatter关键字参数。

尝试一下：

mu, sigma, min_threshold, max_threshold = estimate_gaussian(df['CO2'].values)

xs = df['day']
ys = df['CO2']

for x, y in zip(xs, ys):
    color = 'blue'  # non-outlier color
    if not min_threshold <= y <= max_threshold:  # condition for being an outlier
        color = 'red'  # outlier color
    plt.scatter(x, y, color=color)
plt.show()

Answer 3

您可以创建一个附加列（布尔值），在其中定义该点是否是异常值（真）或错误值（假），然后使用两个散点图：

df["outlier"] = # your boolean np array goes in here
plt.scatter[df.loc[df["outlier"], "day"], df.loc[df["outlier"], "CO2"], color="k"]
plt.scatter[df.loc[~df["outlier"], "day"], df.loc[~df["outlier"], "CO2"], color="r"]

Answer 4

我不确定您的col列表背后的想法是什么，但是您可以将col替换为

col = ['red' if yy in list(outliers) else 'blue' for yy in y]

Answer 5

有几种方法，一种是根据您的条件创建一系列颜色并将其传递给c参数。

df = pd.DataFrame({'CO2': {0: 549.5,
  1: 663.54166699999996,
  2: 830.41666699999996,
  3: 799.695652,
  4: 813.85000000000002,
  5: 769.58333300000004,
  6: 681.94117599999993,
  7: 653.33333300000004,
  8: 845.66666699999996,
  9: 436.08695700000004},
 'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}})

In [11]: colors = ['r' if n<750 else 'b' for n in df['CO2']]

In [12]: colors
Out[12]: ['r', 'r', 'b', 'b', 'b', 'b', 'r', 'r', 'b', 'r']

In [13]: plt.scatter(df['day'],df['CO2'],c=colors)

或使用np.where创建序列

In [14]: colors = np.where(df['CO2'] < 750, 'r', 'b')

在散点图上标记离群值

5 个答案: