选择非NA值并根据列名分配变量

时间:2018-05-22 08:15:09

标签: r dplyr na

我获得了一个数据集,参与者可以在16种可能的条件之一中进行7次试验。 16个条件来自2x2x2x2设计(即,有四个操纵变量,每个都有两个级别)。让我们说Var1的等级为'Pot'和'Pan'。 Var2的级别为“Hi”和“Low”。 Var 3的等级为“Up”和“Down”。 Var 4的级别为“One”和“Two”。

数据集包括每个参与者在每个条件下的每个观察的列 - 也就是说,每行有112(16 * 7)列(以及包含人口统计资料等的一些列),105(15 * 7)其中是空的。条件在列标签中编码,因此列的范围从'PotHiUp1'到'PanLowDown2'。

数据如下所示:

Var1 <- c('Pot', 'Pan')
Var2 <- c('Hi', 'Low')
Var3 <- c('Up', 'Down')
Var4 <- c('One','Two')
Obs <- seq(1,7,1)

df <- expand.grid(Var1,Var2,Var3,Var4,Obs)
df <- df %>% 
  arrange(Var1,Var2,Var3,Var4)

x <- apply(df,1,paste,collapse="")

id <- seq(1,16,1)
age <- rep(20,16)
df <- as.data.frame(cbind(id, age))

for (i in 1:length(x)) {
  df[,ncol(df)+1] <- NA
  names(df)[ncol(df)] <- paste0(x[i])
}

j <- seq(3,ncol(df),7)

for (i in 1:nrow(df)) {
    df[i,c(j[i]:(j[i]+6))] <- 10
}

我想整理这个数据框,以便每行有4列(每个变量一列)指定条件,7列带有观察值。

我的解决方案是使用dplyr过滤数据,如下所示:

Df1 <- df %>% 
  filter(!is.na(PotHiUpOne1)) %>% 
  mutate(Var1 = 'pot', Var2 = 'hi', Var3 = 'up', Var4 = 'one')

然后删除NA列:

Df1 <- Filter(function(x)!all(is.na(x)), Df1)

我这样做了16次(每个条件一次),然后在重新命名剩下的七个观察列后,最终将我创建的16个数据帧绑定在一起,以便它们匹配。

我想知道是否有人可以提出更有效的方法,最好使用dplyr。

编辑:当我说&#34;高效&#34;我的意思是代码更优雅的方法,而不是快速运行的东西(数据集不大) - 即,不会涉及写出16次或多或少相同代码块的东西。

2 个答案:

答案 0 :(得分:2)

希望这是你想要的:

library(data.table)

dtt <- as.data.table(df)
dtt2 <-  melt(dtt, id.vars = c('id', 'age'))[!is.na(value)]
dtt2[, c('var1', 'var2', 'var3', 'var4', 'cond') := tstrsplit(variable, '(?!^)(?=[A-Z0-9])', perl = T)]
dtt2[, variable := NULL]
dcast(dtt2, ... ~ cond, value.var = 'value')
#     id age var1 var2 var3 var4  1  2  3  4  5  6  7
#  1:  1  20  Pot   Hi   Up  One 10 10 10 10 10 10 10
#  2:  2  20  Pot   Hi   Up  Two 10 10 10 10 10 10 10
#  3:  3  20  Pot   Hi Down  One 10 10 10 10 10 10 10
#  4:  4  20  Pot   Hi Down  Two 10 10 10 10 10 10 10
#  5:  5  20  Pot  Low   Up  One 10 10 10 10 10 10 10
#  6:  6  20  Pot  Low   Up  Two 10 10 10 10 10 10 10
#  7:  7  20  Pot  Low Down  One 10 10 10 10 10 10 10
#  8:  8  20  Pot  Low Down  Two 10 10 10 10 10 10 10
#  9:  9  20  Pan   Hi   Up  One 10 10 10 10 10 10 10
# 10: 10  20  Pan   Hi   Up  Two 10 10 10 10 10 10 10
# 11: 11  20  Pan   Hi Down  One 10 10 10 10 10 10 10
# 12: 12  20  Pan   Hi Down  Two 10 10 10 10 10 10 10
# 13: 13  20  Pan  Low   Up  One 10 10 10 10 10 10 10
# 14: 14  20  Pan  Low   Up  Two 10 10 10 10 10 10 10
# 15: 15  20  Pan  Low Down  One 10 10 10 10 10 10 10
# 16: 16  20  Pan  Low Down  Two 10 10 10 10 10 10 10

答案 1 :(得分:0)

好的,这不像mt1022的解决方案一样干净,但它不需要data.table。 <{1}}函数需要dplyr,其他所有函数需要case_when

定义两个新功能,basefind_conditions

transform有点笨重但可能很有用,因为如果需要,您可以轻松添加新定义。

find_conditions

find_conditions <- function(x){ x1 <- x x1 <- case_when( x1 == "PotHiUpOne" ~ c("pot", "hi", "up", "one"), x1 == "PotHiUpTwo" ~ c("pot", "hi", "up", "two"), x1 == "PotHiDownOne" ~ c("pot", "hi", "down", "one"), x1 == "PotHiDownTwo" ~ c("pot", "hi", "down", "two"), x1 == "PotLowUpOne" ~ c("pot", "low", "up", "one"), x1 == "PotLowUpTwo" ~ c("pot", "low", "up", "two"), x1 == "PotLowDownOne" ~ c("pot", "low", "down", "one"), x1 == "PotLowDownTwo" ~ c("pot", "low", "down", "two"), x1 == "PanHiUpOne" ~ c("pan", "hi", "up", "one"), x1 == "PanHiUpTwo" ~ c("pan", "hi", "up", "two"), x1 == "PanHiDownOne" ~ c("pan", "hi", "down", "one"), x1 == "PanHiDownTwo" ~ c("pan", "hi", "down", "two"), x1 == "PanLowUpOne" ~ c("pan", "low", "up", "one"), x1 == "PanLowUpTwo" ~ c("pan", "low", "up", "two"), x1 == "PanLowDownOne" ~ c("pan", "low", "down", "one"), x1 == "PanLowDownTwo" ~ c("pan", "low", "down", "two") ) if(NA %in% x1){ cat("Error: Input not recognized") } else{ return(x1) } } transform获取行并将其转换为我们想要的格式。这取决于我们已定义的df函数。

find_conditions

现在使用这两个功能非常简单:

transform <- function(row){
  row1 <- row[3:length(row)] # Forget about id and age columns, will put them back at the end

  cols <- colnames(row1)[!is.na(row1)] # Get names of the columns which are not NA
  cols <- substr(cols,1,nchar(cols)-1) # Slice off the last character (The number)
  cols <- cols[!duplicated(cols)] # Columns should all have the same name now - find it by removing duplicates
  vars <- find_conditions(cols) # Use our new find_conditions function to break it up into individual conditions

  row1 <- row1[!is.na(row1)] # Keep only non-NA values

  new_row <- c(row[1:2],row1,vars) # put id, age, row1, vars together
  as.vector(unlist(new_row)) # Return as an unnamed vector
}

将它留在循环中,因为你说它不是一个大数据集。祝你好运!