我有一个输入数据框如下:
COL1 COL2
10 res prt
10 res
10 kitty
10 dog
10 kitty cat
10 doggy dog
我希望输出如下:即COL2应该包含非重复的连接值
COL1 COL2
10 res prt, kitty, dog, cat, doggy
请有人帮我这个,因为我是R
的新手答案 0 :(得分:1)
如果10
是唯一的条目,那么:
> new.df <- data.frame(COL1 = 10, COL2 = paste(unique(unlist(strsplit(paste(df$COL2), split = " "))), collapse = " "))
结果:
> new.df
COL1 COL2
1 10 res prt kitty dog cat doggy
修改强>
要获得确切的答案,请尝试使用 dumb 暴力(导致R中的每个for
被认为是错误的,我认为)brute force
解决方案:
> str <- paste(df$COL2)
> str
[1] "res prt" "res" "kitty" "dog" "kitty cat" "doggy dog"
> for(i in 2:length(str)) {
Remaining.Words <- unlist(strsplit(str[1:i-1], split = " "))
My.Words <- unlist(strsplit(str[i], split = " "))
for(k in 1:length(My.Words)) {
if(My.Words[k] %in% Remaining.Words) My.Words <- My.Words[-k]
}
if(length(My.Words) > 0) str[i] <- paste(My.Words, collapse = " ")
else str <- str[-i]
}
> str
[1] "res prt" "kitty" "dog" "cat" "doggy" "NA"
> new.df <- data.frame(COL1 = 10, COL2 = paste(str[-6], collapse = ","))
Result_2.0:
> new.df
COL1 COL2
1 10 res prt,kitty,dog,cat,doggy
答案 1 :(得分:0)
这是一个简单的例子:
# a text column
txt <- c("foo bar", "bar", "foo")
# split it into words
words <- unlist(strsplit(txt, " "))
# return the unique values of this
unique(words)
[1] "foo" "bar"
有意义吗?如果你想将它们拼接出来,你可以说:
cat(unique(words))
答案 2 :(得分:0)
您可以使用dplyr,尝试:
df <- data.frame(COL1 = c(rep(10, 4), rep(20, 3)),
COL2 = c("res prt", "res", "kitty", "kitty cat",
"res", "kitty", "kitty cat"),
stringsAsFactors = FALSE)
## COL1 COL2
## 1 10 res prt
## 2 10 res
## 3 10 kitty
## 4 10 kitty cat
## 5 20 res
## 6 20 kitty
## 7 20 kitty cat
library(dplyr)
makeString <- function(x) {
res <- unlist(strsplit(x, " "))
res <- unique(res)
paste(res, collapse = ", ")
}
df %>% group_by(COL1) %>% summarise_all(makeString)
这会给你:
## A tibble: 2 × 2
## COL1 COL2
## <dbl> <chr>
## 1 10 res, prt, kitty, cat
## 2 20 res, kitty, cat