Boxplot词典而不是列表?

时间:2012-11-12 16:10:40

标签: python matplotlib

让我们说我想创建一个列表的箱线图,其中包含数字1-5,每个数字大约一百万次。

这样的清单大约是5 000 000,但是它表示为一个根本没有空间的字典:

s = {1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5:1000000}

问题是,如果我尝试创建该dict的boxplot,我会收到错误

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    ax.boxplot(s)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/matplotlib/axes.py", line 5462, in boxplot
    if not hasattr(x[0], '__len__'):
KeyError: 0

是否有一种巧妙的方法来绘制字典s,而不必将所有元素都放在列表中?


评论建议我尝试

boxplot(n for n, count in s.iteritems() for _ in xrange(count))

但这导致了

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    boxplot(n for n, count in s.iteritems() for _ in xrange(count))
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2134, in boxplot
    ret = ax.boxplot(x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/matplotlib/axes.py", line 5462, in boxplot
    if not hasattr(x[0], '__len__'):
TypeError: 'generator' object has no attribute '__getitem__'

2 个答案:

答案 0 :(得分:4)

使用图片描述数据的全部意义在于对整个数据有一种感觉,而不是非常精确。所以 通过为每1000个实际数据点生成一个代表性数据点来缩小数据没有太大的危害:

x = [val for val, num in s.items() for i in range(num//1000)]

肉眼应该足够好了:

import matplotlib.pyplot as plt
import numpy as np
s = {1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5:1000000}
x = [val for val, num in s.items() for i in range(num//1000)]
dct = plt.boxplot(x)
plt.show()

答案 1 :(得分:2)

据我所知,matplotlib没有这种数据的方法。基本上,您必须计算相关统计数据并实施自己绘制箱图的方法。这可能会让你开始:

import matplotlib.pyplot as plt
import numpy as np


s = [{1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5:1000000},
     {1: 1000000, 0: 1000000, 8: 1000000, 3: 1000000, 7:1000000}]

def boxplot(data, x=0):

    sorted_data = np.array(data.items())
    sorted_data = np.sort(sorted_data, 0)
    values = sorted_data[:,0]
    freqs = sorted_data[:,1]
    freqs = np.cumsum(freqs)
    freqs = freqs*1./np.max(freqs)

    #get 25%, 50%, 75% percentiles
    idx = np.searchsorted(freqs, [0.25, 0.5, 0.75])
    p25, p50, p75 = values[idx]
    vmin, vmax = values.min(), values.max()

    ax = plt.gca()
    l,r = -0.2+x, 0.2+x
    #plot boxes
    plt.plot([l,r], [p50, p50], 'k')
    plt.plot([l, r, r, l, l], [p25, p25, p75, p75, p25], 'k')
    plt.plot([x,x], [p75, vmax], 'k')
    plt.plot([x,x], [p25, vmin], 'k')

for i in range(len(s)):
    boxplot(s[i],i)
plt.xlim(-0.5,1.5)
plt.show()