Python:使用category和markersize绘制散点图

时间:2017-06-28 12:33:20

标签: python pandas matplotlib ggplot2 seaborn

我正在尝试在Python中从here中选取一个R ggplot2图。我正在查看相关散点图,如下所示。 enter image description here

导入数据

import pandas as pd
midwest= pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest.csv") 

默认大熊猫散点图

midwest.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)))

上面的代码本身不会对不同的类别进行颜色编码,看起来如下所示。

enter image description here

Pandas Groupby + Scatter plot

但是,我们可以按“状态”对数据帧进行分组,然后分别为每个组(ref)执行散点图。

fig, ax = plt.subplots()
groups = midwest.groupby('state')
for name, group in groups:
    ax.plot(group.area, group.poptotal, marker='o', linestyle='', ms=10, 
label=name)
ax.legend(numpoints=1)
ax.set_ylim((0, 500000))

enter image description here

虽然这确实在散点图中为我们提供了不同的类别,但它并没有按popdensity来确定它们的大小。

Seaborn对情节

 import seaborn as sns
sns.pairplot(x_vars=["area"], y_vars=["poptotal"], data=midwest, 
hue="state", size=5)
plt.gca().set_ylim((0, 50000))

enter image description here

同样,这只是按类别划分的散点图。但是,我们仍然没有popdensity

的标记大小

Matplotlib

以下是我们如何进入每个数据点并在Matplotlib中绘制图表。

fig, ax = plt.subplots()
groups = midwest.groupby('state')
min_popdensity, max_popdensity = midwest['popdensity'].min(), midwest['popdensity'].max()
for name, group in groups:
    for data_point in group.itertuples():
        ax.plot(data_point.area, data_point.poptotal, marker='o', linestyle='', 
                   ms=1+12*((max_popdensity-data_point.popdensity)/(max_popdensity-min_popdensity)), label=name)
ax.set_ylim((0, 500000))

这产生了与目标图几乎相似的图。 enter image description here

问题

  1. 我们如何根据点的popdensity得到标记,而不进行所有繁重的工作(比如单独绘制每个点)?
  2. 我们如何添加ggplot可视化中显示的平滑线。
  3. 其他信息

    以下是中西部数据框的head

    PID county  state   area    poptotal    popdensity  popwhite    popblack    popamerindian   popasian    ... percollege  percprof    poppovertyknown percpovertyknown    percbelowpoverty    percchildbelowpovert    percadultpoverty    percelderlypoverty  inmetro category
    0   561 ADAMS   IL  0.052   66090   1270.961540 63917   1702    98  249 ... 19.631392   4.355859    63628   96.274777   13.151443   18.011717   11.009776   12.443812   0   AAR
    1   562 ALEXANDER   IL  0.014   10626   759.000000  7054    3496    19  48  ... 11.243308   2.870315    10529   99.087145   32.244278   45.826514   27.385647   25.228976   0   LHR
    2   563 BOND    IL  0.022   14991   681.409091  14477   429 35  16  ... 17.033819   4.488572    14235   94.956974   12.068844   14.036061   10.852090   12.697410   0   AAR
    3   564 BOONE   IL  0.017   30806   1812.117650 29344   127 46  150 ... 17.278954   4.197800    30337   98.477569   7.209019    11.179536   5.536013    6.217047    1   ALU
    4   565 BROWN   IL  0.018   5836    324.222222  5264    547 14  5   ... 14.475999   3.367680    4815    82.505140   13.520249   13.022889   11.143211   19.200000   0   AAR
    

    并且,这是原始帖子中使用的ggplot2代码。

    options(scipen=999)  # turn-off scientific notation like 1e+48
    library(ggplot2)
    theme_set(theme_bw())  # pre-set the bw theme.
    data("midwest", package = "ggplot2")
    
    
    # Scatterplot
    gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
      geom_point(aes(col=state, size=popdensity)) + 
      geom_smooth(method="loess", se=F) + 
      xlim(c(0, 0.1)) + 
      ylim(c(0, 500000)) + 
      labs(subtitle="Area Vs Population", 
           y="Population", 
           x="Area", 
           title="Scatterplot", 
           caption = "Source: midwest")
    
    plot(gg)
    

    修改

    我不知道问题是否会重新开启(标记为重复)。与此同时,这里只有熊猫的答案才能合理地运作。

    fig, ax = plt.subplots()
    groups = midwest.groupby('state')
    colors = ['b','g','r','y','k']
    for i, (name, group) in enumerate(groups):
        group.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)), s=10+group['popdensity']*0.01, label=name, ax=ax, color=colors[i])
    lgd = ax.legend(numpoints=1)
    for handle in lgd.legendHandles:
        handle.set_sizes([100.0])
    ax.set_ylim((0, 500000))
    

    enter image description here

    编辑2

    https://stackoverflow.com/users/3707607/ted-petrou评论中提到的以下答案似乎解决了使用seaborn的问题。

    sizes = [10, 40, 70, 100, 130] 
    marker_size = pd.cut(4*midwest['popdensity'], [0, 20000, 40000, 60000, 80000, 1000000], labels=sizes) 
    sns.lmplot('area', 'poptotal', data=midwest, hue='state', fit_reg=False, scatter_kws={'s':marker_size})
    plt.ylim((0, 500000))
    

    enter image description here

0 个答案:

没有答案