R中cdplot()密度计算的问题

时间:2016-06-24 14:46:38

标签: r density-plot

(不确定此问题是否属于CrossValidated或Stackoverflow)

我的数据子集:

mdat1 <- structure(list(Name = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Bilbao", 
"San Sebastian", "Vitoria"), class = "factor"), PrecipTotal = c(0, 
1.01600203200406, 0, 6.09601219202438, 73.4061468122936, 4.31800863601727, 
0, 0.254000508001016, 7.8740157480315, 5.58801117602235, 0, 0, 
0, 0, 2.03200406400813, 0, 0.254000508001016, 0, 2.03200406400813, 
0, 0, 0, 57.9121158242316, 1.77800355600711, 0, 0.762001524003048, 
6.3500127000254, 0, 0, 1.27000254000508, 8.89001778003556, 1.01600203200406, 
0, 0, 0, 0, 0.762001524003048, 0, 8.89001778003556, 0, 0, 21.8440436880874, 
0, 0.508001016002032, 0, 0.508001016002032, 0.508001016002032, 
0, 0, 0, 14.4780289560579, 0.254000508001016, 0.508001016002032, 
0, 23.3680467360935, 6.09601219202438, 0, 0, 0, 0, 28.1940563881128, 
0, 0, 0, 3.04800609601219, 0, 0, 0, 0, 6.09601219202438, 0, 2.03200406400813, 
0, 4.06400812801626, 0, 0.508001016002032, 0, 0, 0.508001016002032, 
7.11201422402845, 34.0360680721361, 0, 0, 0, 7.8740157480315, 
0, 4.06400812801626, 0, 0, 0.508001016002032, 5.08001016002032, 
7.11201422402845, 7.11201422402845, 0, 0, 0, 1.01600203200406, 
0, 0, 0), Hail = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Hail", 
"NoHail"), class = "factor")), .Names = c("Name", "PrecipTotal", 
"Hail"), row.names = c(43878L, 33821L, 40681L, 35121L, 45112L, 
46428L, 45844L, 43199L, 34440L, 43184L, 32850L, 39220L, 38416L, 
33860L, 34867L, 32737L, 43232L, 31772L, 35850L, 38894L, 39289L, 
33148L, 32159L, 43197L, 43962L, 45068L, 41848L, 35929L, 34842L, 
42069L, 39503L, 31747L, 43286L, 34919L, 43925L, 45368L, 42489L, 
41686L, 43194L, 34747L, 37001L, 42923L, 45006L, 46170L, 33191L, 
34392L, 44047L, 35859L, 42159L, 38843L, 45860L, 34180L, 33846L, 
42810L, 46160L, 33523L, 34840L, 40226L, 42868L, 43576L, 46570L, 
39980L, 42453L, 42063L, 38121L, 32822L, 40670L, 32859L, 46228L, 
40239L, 32420L, 38874L, 39638L, 39523L, 31765L, 32753L, 33752L, 
35574L, 36263L, 32871L, 32539L, 38455L, 41119L, 45124L, 34560L, 
34144L, 41461L, 41449L, 35499L, 42783L, 34106L, 38151L, 36313L, 
46593L, 39973L, 43928L, 35240L, 43626L, 46195L, 44388L), class = "data.frame")

使用以下代码

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", main = "1", 
                 xlab = "", 
                 col = c("purple", "gray"))

创建cdplot()的混乱输出(&#34; 1&#34; )。使用原始数据的不同样本生成标有&#34; 2&#34;

的输出

http://i63.tinypic.com/2d00yfo.png

我认为它与x值的分布有关?如果它们极度偏斜(例如&#34; 1&#34;),密度计算会遇到麻烦?

http://i65.tinypic.com/20rqtg2.png

3 个答案:

答案 0 :(得分:2)

以下是我在不修改数据的情况下调整bw参数时的外观,因此我想说只需使用bw参数。

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", 
              xlab = "", 
              col = c("purple", "gray"), bw = 1)

enter image description here

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", 
              xlab = "", 
              col = c("purple", "gray"), bw = 2)

enter image description here

答案 1 :(得分:2)

我说这只是一个错误,虽然你会在帮助页面说出&#34时对你的警告相当模糊;有条件的密度对于x&#34;的高密度区域更可靠。将所有这些努力与格子densityplot得到的结果进行对比。 (在我看来,更加清晰和内容丰富。)cdplotggplot努力似乎严重扭曲了数据。

library(lattice)
densityplot(~PrecipTotal, groups=Hail, mdat1, col = c("purple", "gray"))

您可以将数据的显示与您从中获得的较少病态外观的输出进行对比:

cdplot(Hail ~ PrecipTotal, data=mdat1, bw=2)

...但是这仍然让你觉得两组在45-65区域的密度存在很大差异,而并排显示你应该在一个区域内存在差距并且另一组中的单个点似乎更容易通过随机变化来解释。

enter image description here

有一个很好的观点,即格子绘图参数约定是单独的图由包含分组变量的公式规范产生,而使用groups=机制进行分组包括它们在同一个情节区域。

答案 2 :(得分:1)

我认为您可能需要先考虑转换PrecipTotal变量,然后创建条件密度图。在稍微调整一下之后,似乎采用变量的sqrt就足够了。我们可能还需要调整binwidth以获得更好看的情节。

显然,这些转变和调整要求我们对我们对这种关系的解释非常谨慎。

使用R

的基础cdplot
cdplot(Hail ~ sqrt(PrecipTotal), data = mdat1)

enter image description here

ggplot2使用geom_densityposition = 'fill'

library(ggplot2)
ggplot(mdat1, aes(sqrt(PrecipTotal)))+
    geom_density(aes(fill = Hail), position = 'fill')+
    theme_bw()

enter image description here

ggplot2有一些选项

ggplot(mdat1, aes(sqrt(PrecipTotal)))+
    geom_density(aes(fill = Hail), position = 'fill',
                 kernel = 'cosine', adjust = 1.1)+
    theme_bw()

enter image description here