R干净的方式来绘制hdf5的密度

时间:2018-05-15 21:52:22

标签: r ggplot2

所以我有一个hdf5文件,它有24列和许多行。 每一行都是一次观察。 在24列22中包含变量,1包含描述观察的“真值”的目标值,1包含该数据点的权重。

我希望能够绘制每个变量的密度,以比较真值之间的分布。

示例

让我们将这个稍微简单的设置用于说明;

example_data <- c(rnorm(20, 0, 0.5), rnorm(20, 1, 0.5), abs(rnorm(20, 0.5, 0.5)), sample(0:2, 20, replace=T))
data_mat <- matrix(example_data, nrow=20, ncol=4)
colnames(data_mat) <- c("cute.variable", "fuzzy.variable", "weight", "target")

实际上,我从hdf5(带h5read)得到我的数据,这是一个矩阵。然后我从另一个文本文件中读取列名,因为h5read似乎忽略了该数据。

然后,为了绘制每个变量的密度函数,分割目标值,我这样做;

library(ggplot2)
library(reshape)
# weigths
w_0_long = melt(data_mat[which(data_mat[,'target']==0), "weight"])
w_1_long = melt(data_mat[which(data_mat[,'target']==1), "weight"])
w_2_long = melt(data_mat[which(data_mat[,'target']==2), "weight"])

for(name in colnames(data_mat)){
  if(name == "target") next
  if(name == "weight") next
  # raw data
  var_0_long = melt(data_mat[which(data_mat[,'target']==0), name])
  var_1_long = melt(data_mat[which(data_mat[,'target']==1), name])
  var_2_long = melt(data_mat[which(data_mat[,'target']==2), name])

  raw_plot <- ggplot() + geom_density(aes(value), colour="red", data=var_0_long) + 
    geom_density(aes(value), colour="blue", data=var_1_long)+ 
    geom_density(aes(value), colour="green", data=var_2_long)
  print(raw_plot)

  readline(prompt="Press [enter] to continue")
  # weighted data
  weighted_plot <- ggplot() + geom_density(aes(value, weight=w_0_long), colour="red", data=var_0_long) + 
    geom_density(aes(value, weight=w_1_long), colour="blue", data=var_1_long)+ 
    geom_density(aes(value, weight=w_2_long), colour="green", data=var_2_long)
  print(weighted_plot)

  readline(prompt="Press [enter] to continue")
}

问题

当然有更好的方法来绘制hdf5s的密度? 也许有一种方法可以在开始时将矩阵转换为数据帧,但是如果不手动添加所有22个变量,我似乎无法做到这一点,我宁愿不用硬代码,因为它可能会改变。此外,每个目标的变量数量不同,因此在某些时候仍然需要按目标进行分割。

我认为我需要它是ggplot,因为这将计算加权密度图。

1 个答案:

答案 0 :(得分:1)

您可以通过target进行分析,并将它们全部放在一个图上:

library(tidyverse)
set.seed(47)

# generate data
matrix(c(rnorm(20, 0, 0.5), 
         rnorm(20, 1, 0.5), 
         abs(rnorm(20, 0.5, 0.5)), 
         sample(0:2, 20, replace = TRUE)), 
       # dimensions
       nrow = 20, 
       ncol = 4, 
       # set column names
       dimnames = list(NULL, c("cute.variable", "fuzzy.variable", "weight", "target"))) %>% 
    # coerce to data frame
    as.data.frame() %>% 
    # reshape to long form
    gather(variable, value, contains('variable')) %>% 
    # plot, coercing `target` to factor so it's discrete
    ggplot(aes(value, weight = weight, color = factor(target), fill = factor(target))) + 
    geom_density(alpha = 0.3) + 
    # separate facets by `variable`
    facet_wrap(~variable)
#> Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
#> = kernel, : sum(weights) != 1 -- will not get true density

请注意警告,这可能是也可能不是问题。