我正在研究数据集并代表变量。我尝试使用此数据集https://archive.ics.uci.edu/ml/datasets/automobile。我想代表city-mpg
和highway-mpg
vs num-of-cylinders
。我的代码在R
library(ggplot2)
data <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', header=F, sep = "," ,dec = ".",
colClasses = c('factor','numeric','factor','factor','factor','factor','factor','factor','factor',
'numeric','numeric','numeric','numeric','numeric','factor','factor','numeric',
'factor','numeric','numeric','numeric','numeric','numeric','numeric',
'numeric','numeric'), na.strings = "?")
colnames(data) <- c("symboling", "normalized-losses","make","fuel-type","aspiration",
"num-of-doors","body-style","drive-wheels","engine-location","wheel-base","length",
"width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system",
"bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price")
summary(data)
data$`num-of-cylinders` <- as.character(data$`num-of-cylinders`)
data$`num-of-cylinders`[which(data$`num-of-cylinders` == "two")] <- "2"
data$`num-of-cylinders`[which(data$`num-of-cylinders` == "three")] <- "3"
data$`num-of-cylinders`[which(data$`num-of-cylinders` == "four")] <- "4"
data$`num-of-cylinders`[which(data$`num-of-cylinders` == "five")] <- "5"
data$`num-of-cylinders`[which(data$`num-of-cylinders` == "six")] <- "6"
data$`num-of-cylinders`[which(data$`num-of-cylinders` == "eight")] <- "8"
data$`num-of-cylinders`[which(data$`num-of-cylinders` == "twelve")] <- "12"
data$`num-of-cylinders` <- as.numeric(data$`num-of-cylinders`)
data$`num-of-cylinders` <- as.factor(data$`num-of-cylinders`)
ggplot(data = data, aes(x = `num-of-cylinders`, y = `city-mpg`)) +
geom_boxplot() +
xlab('Number of Cylinders') +
ylab('MPG') +
ggtitle('MPG Comparison by Number of Cylinders')
ggplot(data = data, aes(x = `num-of-cylinders`, y = `highway-mpg`)) +
geom_boxplot() +
xlab('Number of Cylinders') +
ylab('MPG') +
ggtitle('MPG Comparison by Number of Cylinders')
我可以单独表示箱线图,但有一种方法可以使用相同的y
轴(city-mpg
和highway-mpg
)?
答案 0 :(得分:2)
我首先提出建议:使用下划线(_)代替连字符( - )来提供列名称,因为您可以拨打data$city-mpg
,但不能拨打ggplot
其次,data_long <- tidyr::gather(data, key = measure, value = mpg, `city-mpg`, `highway-mpg`)
ggplot(data_long, aes(x = `num-of-cylinders`, y = mpg, color = measure)) +
geom_boxplot() +
xlab('Number of Cylinders') +
ylab('MPG') +
ggtitle('MPG Comparison by Number of Cylinders and Road Type')
通常期望数据采用长格式,而不是宽格式。想想你正在尝试做什么:将mpg与气缸数量进行比较,按条件(城市与高速公路)分组。将其重塑为长格式,并将城市与高速公路视为一个变量,您可以在其上映射颜色或拆分为方面。
您可以添加到代码中的两个选项如下:一个使用颜色,一个使用facet。
ggplot(data_long, aes(x = `num-of-cylinders`, y = mpg)) +
geom_boxplot() +
xlab('Number of Cylinders') +
ylab('MPG') +
ggtitle('MPG Comparison by Number of Cylinders and Road Type') +
facet_wrap(~measure)
{{1}}
由reprex package(v0.2.0)创建于2018-04-10。
答案 1 :(得分:1)
与大多数分析过程一样,ggplot最适合长数据。只需将 city_mpg 子集叠加在 highway_mpg 子集的顶部,并使用城市和高速公路的指标。下面使用下划线作为列名而不是连字符。
# RBIND TWO DATAFRAME SUBSETS (RENAMING W/ setNames AND ADDING NEW COLUMN W/ transform)
long_data <- rbind(transform(setNames(data[c("num_of_cylinders", "city_mpg")],
c("num_of_cylinders", "mpg")), mile_type = "city"),
transform(setNames(data[c("num_of_cylinders", "highway_mpg")],
c("num_of_cylinders", "mpg")), mile_type = "highway"))
# PLOT LONG DATA
ggplot(data = long_data, aes(x = num_of_cylinders, y = mpg, colour=mile_type)) +
geom_boxplot() +
xlab('Number of Cylinders') +
ylab('MPG') +
ggtitle('MPG Comparison by Number of Cylinders')