ggplot2中的乱序日期

时间:2016-10-31 16:04:57

标签: r sorting date ggplot2

我通常知道如何在ggplot中订购日期,但这些数据有所不同,我希望有人可以为我澄清。

考虑:

ggplot(tmp3)+
geom_boxplot(aes(x=simdte,y=r2))+
facet_wrap(~simyr, scales='free_x')+
theme(axis.text.x=element_text(angle=45,hjust=1))

日期是按字母数字顺序排列的,但现在我想格式化x轴标签,所以我尝试了:

ggplot(tmp3)+
geom_boxplot(aes(x=reorder(strftime(strptime(simdte,'%Y%m%d'),'%b-%d'),as.numeric(simdte)),y=r2))+
facet_wrap(~simyr, scales='free_x')+
theme(axis.text.x=element_text(angle=45,hjust=1))

但请注意,所有日期均在2015年6月8日之前。

我也试过

tmp3=
tmp3 %>%
mutate(plotsimdte=factor(strftime(strptime(simdte,'%Y%m%d'),'%b-%d'),                        levels=strftime(strptime(unique(simdte),'%Y%m%d'),'%b-%d')[order(unique(simdte))]))

并使用x=plotsimdte绘图,但没有区别。当我创建关于重复级别的因素时,我收到警告,因为我只使用了唯一的值,因此令人困惑。

最后,我试过

ggplot(tmp3)+
geom_boxplot(aes(x=as.Date(simdte,'%Y%m%d'),y=r2, group=simdte))+
scale_x_date(date_labels ='%b-%d')+
facet_wrap(~simyr, scales='free_x')+
theme(axis.text.x=element_text(angle=45,hjust=1))

但是我希望保持日期不连续,因为它们的重要性是标识符,而不是随着时间的推移而分配。

任何建议都将不胜感激。感谢

我数据的一小部分

编辑:使用as.data.frame更新输出输出

> dput(as.data.frame(tmp3))
structure(list(mdldte = c("20130525", "20140407", "20140413", 
"20150608", "20130525", "20150608", "20140420", "20130429", "20130608", 
"20130608", "20140323", "20140413", "20150325", "20150608", "20140511", 
"20130601", "20150608", "20130608", "20140420", "20150305", "20150415", 
"20130608", "20140531", "20150608", "20140531", "20150608", "20130403", 
"20130503", "20150415", "20140407", "20150608", "20140323", "20130525", 
"20140420", "20130403", "20130403", "20130608", "20150501", "20150608", 
"20130429", "20160607", "20140527", "20140420", "20140531", "20140502", 
"20150325", "20140428", "20160620", "20160620", "20130403", "20160527", 
"20150415", "20140413", "20160607", "20140413", "20150608", "20160613", 
"20150608", "20140407", "20150501", "20140323", "20160607", "20140531", 
"20150305", "20150409", "20140428", "20130503", "20130525", "20140428", 
"20140407", "20130503", "20130525", "20130403", "20150305", "20150217", 
"20150501", "20130608", "20150305", "20150217", "20130608", "20140511", 
"20160527", "20140502", "20150415"), simdte = c("20130403", "20130403", 
"20130403", "20130429", "20130429", "20130429", "20130503", "20130503", 
"20130503", "20130525", "20130525", "20130525", "20130601", "20130601", 
"20130601", "20130608", "20130608", "20130608", "20140323", "20140323", 
"20140323", "20140407", "20140407", "20140407", "20140413", "20140413", 
"20140413", "20140420", "20140420", "20140420", "20140428", "20140428", 
"20140428", "20140502", "20140502", "20140502", "20140511", "20140511", 
"20140511", "20140517", "20140517", "20140517", "20140527", "20140527", 
"20140527", "20140531", "20140531", "20140531", "20150217", "20150217", 
"20150217", "20150305", "20150305", "20150305", "20150325", "20150325", 
"20150325", "20150409", "20150409", "20150409", "20150415", "20150415", 
"20150415", "20150427", "20150427", "20150427", "20150501", "20150501", 
"20150501", "20150608", "20150608", "20150608", "20160527", "20160527", 
"20160527", "20160607", "20160607", "20160607", "20160613", "20160613", 
"20160613", "20160620", "20160620", "20160620"), r2 = c(0.862283742909527, 
0.813142444594872, 0.700946018367384, 0.474388980021752, 0.826648311592866, 
0.794283339648572, 0.79687922855493, 0.808984929407683, 0.781751354268809, 
0.535951689307516, 0.68524477567256, 0.716321630808227, 0.373141090466726, 
0.723850452026657, 0.408972539926536, 0.29346057127035, 0.319261073048776, 
0.319535158994707, 0.872351278607699, 0.871652058666136, 0.509872096326808, 
0.398605136979609, 0.420745998256184, 0.596082529689281, 0.793035779455997, 
0.661212720614186, 0.736581215438551, 0.89337362408349, 0.900773593767951, 
0.916946297262156, 0.700865150846107, 0.839501961957186, 0.863684601286204, 
0.819367869015135, 0.765192251153536, 0.590744027549224, 0.720092636591613, 
0.732237645665246, 0.701898569000057, 0.505310296599101, 0.756344530560126, 
0.522404606955389, 0.631453896947287, 0.732767696833121, 0.669168785479052, 
0.340080390313005, 0.397681954572616, 0.708286400101956, 0.551718623201008, 
0.62217661847446, 0.160935876745664, 0.79407487647674, 0.729924604817696, 
0.716024523586796, 0.526169199415047, 0.702098331814224, 0.748626603557805, 
0.432690018453805, 0.710646849035047, 0.526049259906931, 0.811336120223548, 
0.679819505156441, 0.591396577448379, 0.656686513355743, 0.698313842140892, 
0.718604690738853, 0.768070041705958, 0.453336001102217, 0.544446423520199, 
0.583336140040845, 0.172961846412558, 0.298155303932666, 0.731010397306203, 
0.582517045429492, 0.521708072638302, 0.610885761462162, 0.543494236386099, 
0.630580819311437, 0.642714888852003, 0.736302041771047, 0.736086951074143, 
0.444437396681972, 0.445336147280364, 0.43829690520584), simyr = c("2013", 
"2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013", 
"2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013", 
"2013", "2014", "2014", "2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014", "2014", "2014", "2014", "2014", "2015", 
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015", 
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015", 
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2016", 
"2016", "2016", "2016", "2016", "2016", "2016", "2016", "2016", 
"2016", "2016", "2016"), mdlpreds = structure(c(4L, 2L, 3L, 1L, 
3L, 2L, 4L, 2L, 3L, 3L, 4L, 2L, 1L, 2L, 3L, 1L, 3L, 3L, 4L, 4L, 
1L, 1L, 1L, 3L, 2L, 3L, 3L, 4L, 4L, 4L, 2L, 3L, 4L, 2L, 4L, 1L, 
3L, 3L, 3L, 3L, 2L, 1L, 4L, 2L, 4L, 3L, 1L, 4L, 4L, 4L, 3L, 4L, 
2L, 2L, 1L, 3L, 3L, 1L, 3L, 2L, 2L, 3L, 3L, 4L, 4L, 3L, 2L, 1L, 
3L, 2L, 3L, 1L, 2L, 1L, 3L, 1L, 1L, 3L, 2L, 2L, 2L, 1L, 1L, 1L
), .Label = c("phv", "phvfsca", "phvaso", "phvasofsca"), class = "factor")), class = "data.frame", .Names = c("mdldte", 
"simdte", "r2", "simyr", "mdlpreds"), row.names = c(NA, -84L))

1 个答案:

答案 0 :(得分:4)

问题是你的日期目前被解释为角色数据,而R正在改变它们。你真正想要的是将它们视为真正的Date对象,然后让ggplot的更高级函数相应地处理排序和标签。

将日期数据转换为日期类型:

tmp3$newdate <- as.Date(strptime(tmp3$simdte, '%Y%m%d'))

将新日期指定为x值(无需仅选择唯一值),并使用scale_x_date创建漂亮标签。请注意,这也可以正确地跨时间间隔数据点,而不是对日期数据的每个“级别”使用偶数间距。

plot.new <- ggplot(tmp3)+
    geom_point(aes(x= newdate, y=r2))+
    scale_x_date(date_labels = '%b-%d') +
    facet_wrap(~simyr, scales='free_x')+
    theme(axis.text.x=element_text(angle=45,hjust=1))
print(plot.new)

enter image description here

将来,了解str函数很有用,它可以快速告诉您数据列的格式(也可以从RStudio的Environment面板访问):

str(tmp3)

'data.frame':   28 obs. of  7 variables:
 $ mdldte  : chr  "20150305" "20140531" "20160620" "20150305" ...
 $ simdte  : chr  "20130403" "20130429" "20130503" "20130525" ...
 $ r2      : num  0.542 0.485 0.54 0.4 0.594 ...
 $ simyr   : chr  "2013" "2013" "2013" "2013" ...
 $ mdlyr   : chr  "2015" "2014" "2016" "2015" ...
 $ mdlpreds: Factor w/ 4 levels "phv","phvfsca",..: 1 1 1 1 4 1 4 2 3 4 ...
 $ newdate : Date, format: "2013-04-03" "2013-04-29" "2013-05-03" "2013-05-25" ...

如您所见,您原来的“simdte”列将存储为字符数据。 R(和ggplot)会将数据的每个值视为唯一的级别或类别。相反,日期数据基本上是数字的。 R会将它们视为连续,这样可以更容易地在时间轴或轴上精确绘制它们。它还可以更容易地将基础数据与任何绘图标签的格式分开。

更新:使用日期作为类别和绘制箱图,按日期顺序

如果我们想要将每个日期作为一个类别(而不是将日期数据作为数字距离),那么解决方案实际上更简单。当你试图改变输入ggplot美学的值的数量时会发生奇怪的事情,我怀疑这是导致你的错误问题的根本原因。

关键是要依靠ggplot的内置标签功能。再一次,对ggplot的主要调用是原始数据,scale_x_discrete处理漂亮标签的创建:

plot.new <- ggplot(tmp3)+
    geom_boxplot(aes(x=simdte,y=r2))+
    facet_wrap(~simyr, scales='free_x')+
    scale_x_discrete(labels = function(x) strftime(strptime(x, '%Y%m%d'), '%b-%d'))+
    theme(axis.text.x=element_text(angle=45,hjust=1))
print(plot.new)

enter image description here