将具有NA的长格式数据与宽格式完整数据合并以覆盖NA

时间:2017-04-20 05:28:43

标签: r merge melt

所以我需要合并三个数据集。这些包含4年级和5年级的学校数据和读/数学分数。其中一个是长形式数据集,在某些变量中有很多缺失(是的,我确实需要长形式的数据)而另外两个有广泛的缺失数据。所有这些数据框都包含一个列,该列具有数据库中每个个体的唯一ID号。

这是一个完全可重现的示例,它生成我正在使用的data.frames类型的一个小例子......我需要使用的三个数据框如下:school_lf,{{1 }和school4school5包含带有NAs的长格式数据,school_lfschool4是我需要用来填充此长格式数据中的NA的dfs(school5和{{ 1}})

id

我需要将宽格式数据合并到长格式数据中,以用实际值替换NA。我已经尝试了下面的代码,但它引入了几个列而不是合并读取分数和数学分数,其中有NA。我只需要一个包含读取分数的列和一个包含数学分数的列,而不是六个单独的列(gradeset.seed(890) school <- NULL school$id <-sample(102938:999999, 100) school$selected <-sample(0:1, 100, replace = T) school$math4 <- sample(400:500, 100) school$math5 <- sample(400:500, 100) school$read4 <- sample(400:500, 100) school$read5 <- sample(400:500, 100) school <- as.data.frame(school) # Delete observations at random from the school df indm4 <- which(school$math4 %in% sample(school$math4, 25)) school$math4[indm4] <- NA indm5 <- which(school$math5 %in% sample(school$math5, 50)) school$math5[indm5] <- NA indr4 <- which(school$read4 %in% sample(school$read4, 70)) school$read4[indr4] <- NA indr5 <- which(school$read5 %in% sample(school$read5, 81)) school$read5[indr5] <- NA # Separate Read and Math read <- as.data.frame(subset(school, select = -c(math4, math5))) math <- as.data.frame(subset(school, select = -c(read4, read5))) # Now turn this into long form data... clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore") clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore") # Clean up the grades for each of these... clr$grade <- ifelse(clr$variable == "read4", 4, ifelse(clr$variable == "read5", 5, NA)) clm$grade <- ifelse(clm$variable == "math4", 4, ifelse(clm$variable == "math5", 5, NA)) # Put all these in one df school_lf <-cbind(clm, clr$readscore) school_lf$readscore <- school_lf$`clr$readscore` # renames school_lf$`clr$readscore` <- NULL # deletes school_lf$variable <- NULL # deletes ############### # Generate the 2 data frames with IDs that have the full data set.seed(890) school4 <- NULL school4$id <-sample(102938:999999, 100) school4$selected <-sample(0:1, 100, replace = T) school4$math4 <- sample(400:500, 100) school4$read4 <- sample(400:500, 100) school4$grade <- 4 school4 <- as.data.frame(school4) set.seed(890) school5 <- NULL school5$id <-sample(102938:999999, 100) school5$selected <-sample(0:1, 100, replace = T) school5$math5 <- sample(400:500, 100) school5$read5 <- sample(400:500, 100) school5$grade <- 5 school5 <- as.data.frame(school5) read.xread.y,{{1}和math.x)。

math.y

非常感谢任何帮助!我一直试图解决这个问题几个小时,并没有取得任何进展(所以我想在这里问一下)

2 个答案:

答案 0 :(得分:0)

您可以使用coalesce中的dplyr功能。如果第一个向量中的值是NA,它将看到第二个向量中相同位置的值是否不是NA并选择它。如果再次NA,则转到第三个。

library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
   mutate(readscore = coalesce(readscore, read4, read5)) %>% 
   select(id:readscore)

答案 1 :(得分:0)

编辑:我只是尝试对我的实际数据执行此方法并且它不起作用,因为替换数据也有一些NA,因此,我尝试执行Dim idx = 0 For Each tb In TabControl1.Controls.OfType(Of TabPage)() For Each pnl In tb.Controls.OfType(Of Panel)().OrderBy(Function(c) c.TabIndex) For Each cb In pnl.Controls.OfType(Of CheckBox)() cb.Checked = tabel1(idx) = 1 idx += 1 Next Next Next 的dfs具有不同的数字行...回到原点。

我能够通过以下代码解决这个问题(尽管它不是最优雅或最直接的,而且@ Edwin的回应帮助我指明了正确的方向。有关如何做出任何建议使这个代码更加优雅和高效是非常受欢迎的!

coalesce