所以我有一个hdf5文件,它有24列和许多行。 每一行都是一次观察。 在24列22中包含变量,1包含描述观察的“真值”的目标值,1包含该数据点的权重。
我希望能够绘制每个变量的密度,以比较真值之间的分布。
示例
让我们将这个稍微简单的设置用于说明;
example_data <- c(rnorm(20, 0, 0.5), rnorm(20, 1, 0.5), abs(rnorm(20, 0.5, 0.5)), sample(0:2, 20, replace=T))
data_mat <- matrix(example_data, nrow=20, ncol=4)
colnames(data_mat) <- c("cute.variable", "fuzzy.variable", "weight", "target")
实际上,我从hdf5(带h5read
)得到我的数据,这是一个矩阵。然后我从另一个文本文件中读取列名,因为h5read
似乎忽略了该数据。
然后,为了绘制每个变量的密度函数,分割目标值,我这样做;
library(ggplot2)
library(reshape)
# weigths
w_0_long = melt(data_mat[which(data_mat[,'target']==0), "weight"])
w_1_long = melt(data_mat[which(data_mat[,'target']==1), "weight"])
w_2_long = melt(data_mat[which(data_mat[,'target']==2), "weight"])
for(name in colnames(data_mat)){
if(name == "target") next
if(name == "weight") next
# raw data
var_0_long = melt(data_mat[which(data_mat[,'target']==0), name])
var_1_long = melt(data_mat[which(data_mat[,'target']==1), name])
var_2_long = melt(data_mat[which(data_mat[,'target']==2), name])
raw_plot <- ggplot() + geom_density(aes(value), colour="red", data=var_0_long) +
geom_density(aes(value), colour="blue", data=var_1_long)+
geom_density(aes(value), colour="green", data=var_2_long)
print(raw_plot)
readline(prompt="Press [enter] to continue")
# weighted data
weighted_plot <- ggplot() + geom_density(aes(value, weight=w_0_long), colour="red", data=var_0_long) +
geom_density(aes(value, weight=w_1_long), colour="blue", data=var_1_long)+
geom_density(aes(value, weight=w_2_long), colour="green", data=var_2_long)
print(weighted_plot)
readline(prompt="Press [enter] to continue")
}
问题
当然有更好的方法来绘制hdf5s的密度? 也许有一种方法可以在开始时将矩阵转换为数据帧,但是如果不手动添加所有22个变量,我似乎无法做到这一点,我宁愿不用硬代码,因为它可能会改变。此外,每个目标的变量数量不同,因此在某些时候仍然需要按目标进行分割。
我认为我需要它是ggplot
,因为这将计算加权密度图。
答案 0 :(得分:1)
您可以通过target
进行分析,并将它们全部放在一个图上:
library(tidyverse)
set.seed(47)
# generate data
matrix(c(rnorm(20, 0, 0.5),
rnorm(20, 1, 0.5),
abs(rnorm(20, 0.5, 0.5)),
sample(0:2, 20, replace = TRUE)),
# dimensions
nrow = 20,
ncol = 4,
# set column names
dimnames = list(NULL, c("cute.variable", "fuzzy.variable", "weight", "target"))) %>%
# coerce to data frame
as.data.frame() %>%
# reshape to long form
gather(variable, value, contains('variable')) %>%
# plot, coercing `target` to factor so it's discrete
ggplot(aes(value, weight = weight, color = factor(target), fill = factor(target))) +
geom_density(alpha = 0.3) +
# separate facets by `variable`
facet_wrap(~variable)
#> Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel
#> = kernel, : sum(weights) != 1 -- will not get true density
请注意警告,这可能是也可能不是问题。