快速插值R图中的缺失值

时间:2018-12-05 11:01:32

标签: r dataframe ggplot2

我想要一种有效的方法,通过以下原理将具有缺失值的数据帧绘制为R中的线图;

  • 第一个和最后一个值中的NA完全省略(没有线/点)
  • 将实际值内的NA替换为折线图的中间值(不出现点)

这是我的数据框的一个示例(已编辑

df <- data.frame("time" = c(1,2,3,4,5),
             "case1" = c(NA,2,3,4,NA),
             "case2" = c(5,4,3,2,NA),
             "case3" = c(4,NA,NA,NA,2))

这是仅在第一种情况下的工作方式

library(pracma)
df$case1.i <- with(df, interp1(time, case1, time, 'linear'))
library(ggplot2)
ggplot(df, aes(time)) + geom_point(aes(case1 = case1)) + geom_line(aes(case1 = case1.i))

我正在尝试制定一些措施,使其适用于我实际数据框中的大约200列。到目前为止,这段代码似乎无效

for (i in colnames(df)){
  argument <- paste("df$case",i,".i <- with(df, interp1(time, case",i,", time, 'linear'))")
  eval(parse(text=argument))
}

3 个答案:

答案 0 :(得分:1)

将数据读取到新的Zoo对象z中,对其应用na.approx,以填充数据主体中的NA值,然后使用ggplot2进行绘制。如果需要单独的面板,请省略facet = NULL。请注意,fortify.zoomelt = TRUE会将数据转换为带有IndexSeriesValue列的长格式,并在geom_point中使用。如果只需要行,请省略geom_point(...)部分。参见此答案末尾的图片。这里显示的方法相对紧凑,避免粘贴在一起然后评估代码。

library(ggplot2)
library(zoo)

z <- read.zoo(df)
autoplot(na.approx(z), facet = NULL) + 
  geom_point(aes(Index, Value, group = Series), fortify(z, melt = TRUE))

或者如果您想为每一列单独绘制图,请尝试以下方法:

pdf("civy.pdf")

for(i in 1:ncol(z)) {
  p <- autoplot(na.approx(z[, i])) + 
    ylab(names(z)[i]) +
    geom_point(aes(Index, Value), fortify(z[, i], melt = TRUE))
  plot(p)
}

dev.off()

screenshot

答案 1 :(得分:1)

这里有两种解决方案:一种是将所有数据绘制在一起,按颜色区分;另一种情况是,它们通过案例在不同的方面分别绘制。原理基本相同:我使用approx进行线性插值,将数据从宽到长重新格式化以便于在ggplot2中进行绘制,然后进行绘制。在第二个解决方案中,我还创建了一个名为type的新变量,以区分插值数据和原始数据。

绘制在一起

# Create data frame
df <- data.frame("time" = c(1,2,3,4,5),
                 "case1" = c(NA,2,3,4,NA),
                 "case2" = c(1,2,3,4,NA),
                 "case3" = c(1,NA,NA,NA,5)) 

# Perform interpolation on all columns
# Switch from wide to long format
df %<>% 
  mutate_at(vars(contains("case")), funs(interp = approx(time, ., xout = time)$y)) %>% 
  gather(var, val, -time)

# Plot results all in one figure
g <- ggplot() 
g <- g + geom_point(data = df %>% filter(!grepl("interp", var)), aes(x = time, y = val, colour = var))
g <- g + geom_line(data = df %>% filter(grepl("interp", var)), aes(x = time, y = val, colour = var))
print(g)

分别绘制

# Create data frame
df <- data.frame("time" = c(1,2,3,4,5),
                 "case1" = c(NA,2,3,4,NA),
                 "case2" = c(1,2,3,4,NA),
                 "case3" = c(1,NA,NA,NA,5)) 

# Perform interpolation on all columns
# Switch from wide to long format
# Create column to indicate whether raw or interpolated
# Strip "_interp" from var
df %<>% 
  mutate_at(vars(contains("case")), funs(interp = approx(time, ., xout = time)$y)) %>% 
  gather(var, val, -time) %>% 
  mutate(type = ifelse(grepl("interp", var), "interp", "raw"),
         var = gsub("_.*", "", var))

# Plot results all separate figures
g <- ggplot() 
g <- g + geom_point(data = df %>% filter(type == "raw"), aes(x = time, y = val))
g <- g + geom_line(data = df %>% filter(type == "interp"), aes(x = time, y = val))
g <- g + facet_grid(var ~.)
print(g)

enter image description here


使用新数据框进行编辑

df <- data.frame("time" = c(1,2,3,4,5),
                 "case1" = c(NA,2,3,4,NA),
                 "case2" = c(5,4,3,2,NA),
                 "case3" = c(4,NA,NA,NA,2))

df %<>% 
  mutate_at(vars(contains("case")), funs(interp = approx(time, ., xout = time)$y)) %>% 
  gather(var, val, -time) %>% 
  mutate(type = ifelse(grepl("interp", var), "interp", "raw"),
         var = gsub("_.*", "", var))

g <- ggplot() 
g <- g + geom_point(data = df %>% filter(type == "raw"), aes(x = time, y = val, colour = var))
g <- g + geom_line(data = df %>% filter(type == "interp"), aes(x = time, y = val, colour = var))
print(g)

enter image description here

答案 2 :(得分:1)

尽管您在粘贴要评估的参数时有一些错误,但您走在正确的道路上,但在我的头上是那些:

  • 您应使用paste0()删除空格
  • 您正在遍历列名,但使用i作为数字
  • 我会遍历只想插值所有列的列

以下是我上面提到的更改的代码:

cols_to_interpolate <- setdiff(colnames(df), 'time')

for (col in cols_to_interpolate){
  #print(col)
  argument <- paste0("df$", col,"_i <- with(df, interp1(time, ", col,", time , 'linear'))")
  #print(argument)
  eval(parse(text=argument))
}

p <- ggplot (df, aes(x = time))
for (col in cols_to_interpolate){
    p <- p + 
      geom_point(aes_string(y = col, color = shQuote(col)),  na.rm = TRUE) + 
      geom_line(aes_string(y = paste0(col,"_i"), color = shQuote(col)), na.rm = TRUE)
  }
p + ylab('Y Label') + xlab('X Label')

enter image description here

注意:我选择此方法是因为它与您尝试执行的操作最接近,但是我敢肯定,有很多更有效的方法可以得到最终结果。 (当然,减少循环是一个加号)