我仍然是R的相对新手,需要一些关于以下问题的帮助。 我有da数据框,看起来很熟悉(但更复杂)
Token1 A B E
Token2 A F D G
Token3 C F E
Token4 B A F
我想要的是对一行中出现的每个唯一值进行分组,以便一列只包含一个值,如果一行为真,如果不是则为NA,如下所示:
Token1 A B NA NA E NA NA
Token2 A NA NA D NA F G
Token3 NA NA C NA E F NA
Token4 A B NA NA NA F NA
到目前为止我还没有找到任何帮助...我如何得到上述结果?
提前致谢!
编辑:
感谢所有人,但实际的DF要复杂得多,并且在所有列中都包含数千个可能的值(我只使用A,B,C等来简化问题),因此解决方案不会受到影响似乎工作...... 如何将它们全部分组(我知道会有多个单列)?
答案 0 :(得分:1)
您可以尝试:
cols = unique(unlist(df[-1]))
cols = as.vector(sort(cols[!is.na(cols)]))
cbind(as.vector(df[,1]),
t(apply(df[-1], 1, function(u) ifelse(cols %in% u[!is.na(u)], cols, NA))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#1 "cat" NA "dog.01" NA NA NA NA NA
#2 "bird" "cat.01" NA NA "robin.01" NA NA "eagle.01"
#3 "horse" NA "dog.01" "pony.01" NA NA "unicorn.01" NA
#4 "dog" "cat.01" NA NA "robin.01" "bird.01" NA NA
#5 "" NA NA NA NA NA NA NA
数据:强>
df=structure(list(Lemma = structure(c(3L, 2L, 5L, 4L, 1L), .Label = c("", "bird", "cat", "dog", "horse"), class = "factor"), Sim = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("", "cat.01", "dog.01", "pony.01", "robin.01"), class = "factor"), X = structure(c(1L, 3L, 4L, 2L, 1L), .Label = c("", "bird.01", "cat.01", "unicorn.01"), class = "factor"), X.1 = structure(c(1L, 3L, 2L, 4L, 1L), .Label = c("", "dog.01", "eagle.01", "robin.01"), class = "factor")), .Names = c("Lemma", "Sim", "X", "X.1"), row.names = c(NA, 5L), class = "data.frame")
答案 1 :(得分:1)
我不确定为什么你会想要这样的结构,但是在将所有非令牌列折叠成单个字符串之后,你可以从我的“splitstackshape”包中尝试cSplit_e
。
以下是来自Colonel Beauvel回答的样本数据的例子。
df2 <- cbind(df[1], New = do.call(paste, c(df[-1], sep = ",")))
library(splitstackshape)
cSplit_e(df2, "New", ",", mode = "value", type = "character", drop = TRUE)
# Lemma New_ New_bird.01 New_cat.01 New_dog.01 New_eagle.01 New_pony.01
# 1 cat <NA> <NA> dog.01 <NA> <NA>
# 2 bird <NA> <NA> cat.01 <NA> eagle.01 <NA>
# 3 horse <NA> <NA> <NA> dog.01 <NA> pony.01
# 4 dog <NA> bird.01 cat.01 <NA> <NA> <NA>
# 5 <NA> <NA> <NA> <NA> <NA>
# New_robin.01 New_unicorn.01
# 1 <NA> <NA>
# 2 robin.01 <NA>
# 3 <NA> unicorn.01
# 4 robin.01 <NA>
# 5 <NA> <NA>
你必须删除“New_”列,这是由于有一些空列。
答案 2 :(得分:0)
以下是所有字母的另一种解决方案:
df <- data.frame(Token=c("Token1", "Token2", "Token3", "Token4"),
Col1=c("A", "A", "C", "B"), Col2=c("B", "F", "F", "A"),
Col3=c("E", "D", "E", "F"), Col4=c("", "G", "", ""))
df2 <- data.frame(df[, 1], t(sapply(1:dim(df)[1], function(i){
toupper(letters) %in% c(t(df[i, -1]))
})))
names(df2) <- c("Token", toupper(letters))
for(i in 2:27){
df2[, i][df2[, i]==T] <- names(df2)[i]
df2[, i][df2[, i]==F] <- NA
}
df2