我正在尝试在Python中从here中选取一个R ggplot2图。我正在查看相关散点图,如下所示。
import pandas as pd
midwest= pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest.csv")
midwest.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)))
上面的代码本身不会对不同的类别进行颜色编码,看起来如下所示。
但是,我们可以按“状态”对数据帧进行分组,然后分别为每个组(ref)执行散点图。
fig, ax = plt.subplots()
groups = midwest.groupby('state')
for name, group in groups:
ax.plot(group.area, group.poptotal, marker='o', linestyle='', ms=10,
label=name)
ax.legend(numpoints=1)
ax.set_ylim((0, 500000))
虽然这确实在散点图中为我们提供了不同的类别,但它并没有按popdensity
来确定它们的大小。
import seaborn as sns
sns.pairplot(x_vars=["area"], y_vars=["poptotal"], data=midwest,
hue="state", size=5)
plt.gca().set_ylim((0, 50000))
同样,这只是按类别划分的散点图。但是,我们仍然没有popdensity
以下是我们如何进入每个数据点并在Matplotlib中绘制图表。
fig, ax = plt.subplots()
groups = midwest.groupby('state')
min_popdensity, max_popdensity = midwest['popdensity'].min(), midwest['popdensity'].max()
for name, group in groups:
for data_point in group.itertuples():
ax.plot(data_point.area, data_point.poptotal, marker='o', linestyle='',
ms=1+12*((max_popdensity-data_point.popdensity)/(max_popdensity-min_popdensity)), label=name)
ax.set_ylim((0, 500000))
popdensity
得到标记,而不进行所有繁重的工作(比如单独绘制每个点)?以下是中西部数据框的head
。
PID county state area poptotal popdensity popwhite popblack popamerindian popasian ... percollege percprof poppovertyknown percpovertyknown percbelowpoverty percchildbelowpovert percadultpoverty percelderlypoverty inmetro category
0 561 ADAMS IL 0.052 66090 1270.961540 63917 1702 98 249 ... 19.631392 4.355859 63628 96.274777 13.151443 18.011717 11.009776 12.443812 0 AAR
1 562 ALEXANDER IL 0.014 10626 759.000000 7054 3496 19 48 ... 11.243308 2.870315 10529 99.087145 32.244278 45.826514 27.385647 25.228976 0 LHR
2 563 BOND IL 0.022 14991 681.409091 14477 429 35 16 ... 17.033819 4.488572 14235 94.956974 12.068844 14.036061 10.852090 12.697410 0 AAR
3 564 BOONE IL 0.017 30806 1812.117650 29344 127 46 150 ... 17.278954 4.197800 30337 98.477569 7.209019 11.179536 5.536013 6.217047 1 ALU
4 565 BROWN IL 0.018 5836 324.222222 5264 547 14 5 ... 14.475999 3.367680 4815 82.505140 13.520249 13.022889 11.143211 19.200000 0 AAR
并且,这是原始帖子中使用的ggplot2代码。
options(scipen=999) # turn-off scientific notation like 1e+48
library(ggplot2)
theme_set(theme_bw()) # pre-set the bw theme.
data("midwest", package = "ggplot2")
# Scatterplot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) +
xlim(c(0, 0.1)) +
ylim(c(0, 500000)) +
labs(subtitle="Area Vs Population",
y="Population",
x="Area",
title="Scatterplot",
caption = "Source: midwest")
plot(gg)
我不知道问题是否会重新开启(标记为重复)。与此同时,这里只有熊猫的答案才能合理地运作。
fig, ax = plt.subplots()
groups = midwest.groupby('state')
colors = ['b','g','r','y','k']
for i, (name, group) in enumerate(groups):
group.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)), s=10+group['popdensity']*0.01, label=name, ax=ax, color=colors[i])
lgd = ax.legend(numpoints=1)
for handle in lgd.legendHandles:
handle.set_sizes([100.0])
ax.set_ylim((0, 500000))
https://stackoverflow.com/users/3707607/ted-petrou评论中提到的以下答案似乎解决了使用seaborn的问题。
sizes = [10, 40, 70, 100, 130]
marker_size = pd.cut(4*midwest['popdensity'], [0, 20000, 40000, 60000, 80000, 1000000], labels=sizes)
sns.lmplot('area', 'poptotal', data=midwest, hue='state', fit_reg=False, scatter_kws={'s':marker_size})
plt.ylim((0, 500000))