Fitting a histogram with a set of distributions

时间:2017-04-06 17:06:01

标签: python matplotlib web-scraping statistics

I have a program that scrapes wikipedia pages and finds the lengths from any random page to the philosophy page. The program generates a list of the lengths of the paths (from source page to philosophy) which gets passed to another function that plots the frequencies of each path length. My approach here is based on an answer from this SO post.

In this function, I'm fitting the curve with a set of different distribution curves in efforts to see which one best fits the data set. For some reason, it looks like the distribution curves are off center, away from the actual histograms in the graph:

enter image description here

It seems like the distributions should be centered between the histograms. Here is the function for plotting the frequencies:

def plot_lengths(lens):
    """Plot the distribution of path lengths."""
    freq = {}
    max_len = 0

    for length in lens:
        max_len = max(length,max_len)
        if length in freq:
            freq[length] += 1
        else:
            freq[length] = 1
    max_freq = max(freq.values())
    bins = range(0, max_len + 1, 2)
    plt.hist(lens,bins,histtype = 'bar',rwidth = 0.8)
    plt.xlabel('x')
    plt.ylabel('Path Lengths')
    plt.title('Distribution of path lengths')
    dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

    for dist_name in dist_names:
        dist = getattr(scipy.stats, dist_name)
        param = dist.fit(lens)
        pdf_fitted = dist.pdf(bins, *param[:-2], loc=param[-2], scale=param[-1]) * len(lens)
        plt.plot(pdf_fitted, label=dist_name)
        plt.xlim(0,max_len)
        plt.ylim(0,max_freq)
    plt.legend(loc='upper right')
    plt.show()

What could be causing the distributions in the graph to be off center?

1 个答案:

答案 0 :(得分:1)

在绘制拟合时忘记设置x。 第二个for循环中的第4行应该是

plt.plot(bins, pdf_fitted, label=dist_name)