合并三个包含冗余信息的变量

时间:2017-03-03 00:34:46

标签: r

在对数据集进行一些数据争论和合并之后,我得到了三个变量,其中包含与虚拟示例中相同的信息:

cond.x <- c("1","2", "3","4",NA, "4", "1")
cond.y <- c("1", NA, "3",  NA,  "1", "4", NA)
dx <- c("scz", "cont", "siscz", "sicon", "scz", NA,NA)

mydata <-data.frame(cond.x, cond.y, dx)
> mydata
  cond.x cond.y    dx
1      1      1   scz
2      2   <NA>  cont
3      3      3 siscz
4      4   <NA> sicon
5   <NA>      1   scz
6      4      4  <NA>
7      1   <NA>  <NA>

因此1表示scz,2表示cont,3表示siscz,4表示siscon。

  • 现在如何减少冗余创建一个变量以保存信息,从而通过合并来自其他两个变量之一的数据来删除NA?

3 个答案:

答案 0 :(得分:1)

dx转换为因子,并将其等级设为level_dx。然后将mydata的所有3列转换为整数类型。

mydata$dx <- factor(mydata$dx, levels = c("scz", "cont", "siscz", "sicon"))
level_dx <- levels(mydata$dx)
mydata[, 1:2] <- lapply(mydata[, 1:2], function(x) as.integer(as.character(x)) )
mydata$dx <- as.integer(mydata$dx)

使用fill包中的tidyr函数,向上或向下填充其先前值的列,并将dx列转换回因子变量。

library('tidyr')
mydata <- fill( data.frame(t(mydata)), 1:7, .direction = 'up')
mydata <- data.frame( t( fill( mydata, 1:7, .direction = 'down') ) )
mydata$dx <- factor( mydata$dx, levels = sort(unique( mydata$dx )), labels = level_dx)
#    cond.x cond.y    dx
# X1      1      1   scz
# X2      2      2  cont
# X3      3      3 siscz
# X4      4      4 sicon
# X5      1      1   scz
# X6      4      4 sicon
# X7      1      1   scz

数据:

cond.x <- c("1","2", "3","4",NA, "4", "1")
cond.y <- c("1", NA, "3",  NA,  "1", "4", NA)
dx <- c("scz", "cont", "siscz", "sicon", "scz", NA,NA)

mydata <-data.frame(cond.x, cond.y, dx)
mydata
#   cond.x cond.y    dx
# 1      1      1   scz
# 2      2   <NA>  cont
# 3      3      3 siscz
# 4      4   <NA> sicon
# 5   <NA>      1   scz
# 6      4      4  <NA>
# 7      1   <NA>  <NA>

答案 1 :(得分:1)

有点短,主要得益于data.table包:

x <- c("1","2", "3","4",NA, "4", "1")
y <- c("1", NA, "3",  NA,  "1", "4", NA)
dx <- c("scz", "cont", "siscz", "sicon", "scz", NA,NA)

mydata <- data.frame(x, y, dx, stringsAsFactors = FALSE)

library(data.table)
# Convert to data.table by reference
setDT(mydata)

# Merge x and y into xy
mydata[, xy := unique(na.omit(x), na.omit(y)), by = dx][]
# Create lookup table
lookup <- mydata[, .(xy = first(xy)), by = dx] %>% na.omit() %>% setnames(c('dx_l', 'xy'))
# Join mydata with lookup using xy
mydata[lookup, dy := dx_l, on = c(xy = 'xy')][]

mydata[, .(dy)]
#       dy
# 1:   scz
# 2:  cont
# 3: siscz
# 4: sicon
# 5:   scz
# 6: sicon
# 7:   scz

答案 2 :(得分:1)

我们可以使用coalesce中的tidyr来执行此操作,以便根据&#39; cond.x&#39;创建非NA条目。和&#39; cond.y&#39;,然后使用索引更新&#39; dx&#39;

中的值
library(tidyverse)
mydata %>% 
      mutate(dx = dx[coalesce(cond.x, cond.y)])
#  cond.x cond.y    dx
#1      1      1   scz
#2      2   <NA>  cont
#3      3      3 siscz
#4      4   <NA> sicon
#5   <NA>      1   scz
#6      4      4 sicon
#7      1   <NA>   scz