切割直方图中的异常值(Python)

时间:2018-06-28 08:46:56

标签: python scipy statistics histogram

我想知道,是否有一种方法可以向我显示我的x轴应该有多长。我有不同异常值的记录。我可以用plt.xlim()剪切它们,但是是否有一种统计方法来计算有意义的x轴极限?在添加的图片中,逻辑上的切入将是在行驶150公里之后。计算切割的阈值将是完美的logical manual cut after 150 km

定义获得的数据框是标准的熊猫数据框

代码:

def yearly_distribution(dataframe):


    df_distr = dataframe  

    h=sorted(df_distr['Distance'])
    l=len(h)    

    fig, ax =plt.subplots(figsize=(16,9))

    binwidth = np.arange(0,501,0.5)

    n, bins, patches = plt.hist(h, bins=binwidth, normed=1, facecolor='#023d6b', alpha=0.5, histtype='bar')

    lnspc =np.arange(0,500.5,0.5)

    gevfit = gev.fit(h)  
    pdf_gev = gev.pdf(lnspc, *gevfit)  
    plt.plot(lnspc, pdf_gev, label="GEV")

    logfit = stats.lognorm.fit(h)  
    pdf_lognorm = stats.lognorm.pdf(lnspc, *logfit)  
    plt.plot(lnspc, pdf_lognorm, label="LogNormal")

    weibfit = stats.weibull_min.fit(h)  
    pdf_weib = stats.weibull_min.pdf(lnspc, *weibfit)  
    plt.plot(lnspc, pdf_weib, label="Weibull")

    burrfit = stats.burr.fit(h)  
    pdf_burr = stats.burr.pdf(lnspc, *burrfit)  
    plt.plot(lnspc, pdf_burr, label="Burr Distribution")

    genparetofit = stats.genpareto.fit(h)
    pdf_genpareto = stats.genpareto.pdf(lnspc, *genparetofit)
    plt.plot(lnspc, pdf_genpareto, label ="Generalized Pareto")

    myarray = np.array(h)

    clf = GMM(8,n_iter=500, random_state=3)
    myarray.shape = (myarray.shape[0],1)
    clf = clf.fit(myarray)
    lnspc.shape = (lnspc.shape[0],1)
    pdf_gmm = np.exp(clf.score(lnspc))
    plt.plot(lnspc, pdf_gmm, label = "GMM")

    plt.xlim(0,500)
    plt.xlabel('Distance')
    plt.ylabel('Probability')
    plt.title('Histogram')
    plt.ylim(0,0.05)

1 个答案:

答案 0 :(得分:0)

您应该在进行任何图解或拟合之前从数据中删除异常值:

git remote -v

编辑 也许不是最快的方法,但是使用git branch -u bitbucket_branch_name/local_branch_name

h=sorted(df_distr['Distance'])

out_threshold= 150.0
h=[i for i in h if i<out_threshold]