如何从熊猫df正确制作盒子图?

时间:2017-03-22 03:42:38

标签: python pandas

我有一只大熊猫df,在旋转之后,它打印如下,

country   CHINA    USA
0        119.02    0.0
1        121.20    0.0
3        112.49    0.0
4        113.94    0.0
5        114.67    0.0
6        111.77    0.0
7        117.57    0.0
......................

......................
6648       0.00  420.0
6649       0.00  420.0
6650       0.00  420.0
6651       0.00  420.0
6652       0.00  420.0
6653       0.00  420.0
6654       0.00  500.0
6655       0.00  500.0
6656       0.00  390.0
6657       0.00  450.0
6658       0.00  420.0
6659       0.00  420.0
6660       0.00  450.0 

方法在这里,

def visualize_box_plot(df):

    df = df[df.outlier != 1]
    df = pd.pivot_table(df, 
                     index=df.index, 
                     columns = df['country'],
                     values='value', 
                     fill_value = 0)

    df.CHINA = df.CHINA.round(2)
    df.USA = df.USA.round(2)

    # this is the prints 
    # provided earlier 
    print df 

    df_usa = df[(df['USA'] != 0)]
    df_china = df[(df['CHINA'] != 0)]

    usa = df_usa.as_matrix()[:, -1]
    china = df_china.as_matrix()[:,0]

    print "USA:", len(usa), " ", "CHINA: ", len(china)

    # unequal length 
    # USA: 1673   CHINA:  4384

    x =  [china, usa]
    plt.boxplot(x)
    plt.show()
在转动期间,

Zero值来自NaN,我想在制作方块图时省略它们。所以,我使用代码,

    df_usa = df[(df['USA'] != 0)]
    df_china = df[(df['CHINA'] != 0)]

这些代码实际上创建了单独的df并转换为NUmpy矩阵,最后,我将它们与matplotlib一起可视化。需要考虑的是,Numpy矩阵的长度不一样,因此我不能直接用boxplot调用df函数。

这是我的可视化,其中1和2分别需要替换为CHINA和USA,

enter image description here

可视化效果不佳,我感觉可能有更好的方法 把工作做完。有什么建议吗?一些示例代码将有很多帮助。您可以在小数点后使用df舍入到2位数。主要问题是使代码更优雅并更好地改进可视化。

2 个答案:

答案 0 :(得分:1)

我认为代码可以更简单 - 只需将NaN替换为print (df.mask(df == 0)) #alternative solution #print (df.replace(0,np.nan)) CHINA USA country 0 119.02 NaN 1 121.20 NaN 3 112.49 NaN 4 113.94 NaN 5 114.67 NaN 6 111.77 NaN 7 117.57 NaN 6648 NaN 420.0 6649 NaN 420.0 6650 NaN 420.0 6651 NaN 420.0 6652 NaN 420.0 6653 NaN 420.0 6654 NaN 500.0 6655 NaN 500.0 6656 NaN 390.0 6657 NaN 450.0 6658 NaN 420.0 6659 NaN 420.0 6660 NaN 450.0 df.mask(df == 0).boxplot() ,然后调用DataFrame.boxplot

df.mask(df == 0).plot.box()

graph

另一种可能的解决方案是使用DataFrame.plot.box

{{1}}

graph

Box Plots in docs

答案 1 :(得分:0)

除了 jezrael 提到的 numpy nan,您还可以从 nan 使用 math

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import math
data = {'c1': [1,2,3], 'c2': [5,3,0]}
for k in data:#search and replace zeroes with math.nan
    data[k] = [x if x != 0 else math.nan for x in data[k]]
df = pd.DataFrame(data, columns=list(data.keys()))
df.plot.box(grid='False')    
plt.show()

enter image description here