R中相同列中的值组

时间:2015-04-09 13:13:47

标签: r

我仍然是R的相对新手,需要一些关于以下问题的帮助。 我有da数据框,看起来很熟悉(但更复杂)

Token1  A    B   E
Token2  A    F   D   G  
Token3  C    F   E
Token4  B    A   F

我想要的是对一行中出现的每个唯一值进行分组,以便一列只包含一个值,如果一行为真,如果不是则为NA,如下所示:

Token1  A    B    NA   NA  E   NA  NA
Token2  A    NA   NA   D   NA  F   G  
Token3  NA   NA   C    NA  E   F   NA
Token4  A    B    NA   NA  NA  F   NA

到目前为止我还没有找到任何帮助...我如何得到上述结果?

提前致谢!

编辑:

感谢所有人,但实际的DF要复杂得多,并且在所有列中都包含数千个可能的值(我只使用A,B,C等来简化问题),因此解决方案不会受到影响似乎工作...... 如何将它们全部分组(我知道会有多个单列)?

3 个答案:

答案 0 :(得分:1)

您可以尝试:

cols = unique(unlist(df[-1]))
cols = as.vector(sort(cols[!is.na(cols)]))

cbind(as.vector(df[,1]),
      t(apply(df[-1], 1, function(u) ifelse(cols %in% u[!is.na(u)], cols, NA))))
#  [,1]    [,2]     [,3]     [,4]      [,5]       [,6]      [,7]         [,8]      
#1 "cat"   NA       "dog.01" NA        NA         NA        NA           NA        
#2 "bird"  "cat.01" NA       NA        "robin.01" NA        NA           "eagle.01"
#3 "horse" NA       "dog.01" "pony.01" NA         NA        "unicorn.01" NA        
#4 "dog"   "cat.01" NA       NA        "robin.01" "bird.01" NA           NA        
#5 ""      NA       NA       NA        NA         NA        NA           NA        

数据:

df=structure(list(Lemma = structure(c(3L, 2L, 5L, 4L, 1L), .Label = c("", "bird", "cat", "dog", "horse"), class = "factor"), Sim = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("", "cat.01", "dog.01", "pony.01", "robin.01"), class = "factor"), X = structure(c(1L, 3L, 4L, 2L, 1L), .Label = c("", "bird.01", "cat.01", "unicorn.01"), class = "factor"), X.1 = structure(c(1L, 3L, 2L, 4L, 1L), .Label = c("", "dog.01", "eagle.01", "robin.01"), class = "factor")), .Names = c("Lemma", "Sim", "X", "X.1"), row.names = c(NA, 5L), class = "data.frame")

答案 1 :(得分:1)

我不确定为什么你会想要这样的结构,但是在将所有非令牌列折叠成单个字符串之后,你可以从我的“splitstackshape”包中尝试cSplit_e

以下是来自Colonel Beauvel回答的样本数据的例子。

df2 <- cbind(df[1], New = do.call(paste, c(df[-1], sep = ",")))
library(splitstackshape)

cSplit_e(df2, "New", ",", mode = "value", type = "character", drop = TRUE)
#   Lemma New_ New_bird.01 New_cat.01 New_dog.01 New_eagle.01 New_pony.01
# 1   cat             <NA>       <NA>     dog.01         <NA>        <NA>
# 2  bird <NA>        <NA>     cat.01       <NA>     eagle.01        <NA>
# 3 horse <NA>        <NA>       <NA>     dog.01         <NA>     pony.01
# 4   dog <NA>     bird.01     cat.01       <NA>         <NA>        <NA>
# 5                   <NA>       <NA>       <NA>         <NA>        <NA>
#   New_robin.01 New_unicorn.01
# 1         <NA>           <NA>
# 2     robin.01           <NA>
# 3         <NA>     unicorn.01
# 4     robin.01           <NA>
# 5         <NA>           <NA>

你必须删除“New_”列,这是由于有一些空列。

答案 2 :(得分:0)

以下是所有字母的另一种解决方案:

df <- data.frame(Token=c("Token1", "Token2", "Token3", "Token4"),
       Col1=c("A", "A", "C", "B"), Col2=c("B", "F", "F", "A"), 
       Col3=c("E", "D", "E", "F"), Col4=c("", "G", "", ""))

df2 <- data.frame(df[, 1], t(sapply(1:dim(df)[1], function(i){
  toupper(letters) %in% c(t(df[i, -1]))
})))

names(df2) <- c("Token", toupper(letters))

for(i in 2:27){
  df2[, i][df2[, i]==T] <- names(df2)[i]
  df2[, i][df2[, i]==F] <- NA
}

df2