Question

我有一个输入数据框如下：

COL1    COL2
10     res prt
10     res
10     kitty
10     dog 
10     kitty cat
10     doggy dog

我希望输出如下：即COL2应该包含非重复的连接值

COL1    COL2
10  res prt, kitty, dog, cat, doggy

请有人帮我这个，因为我是R

的新手

Answer 1

如果10是唯一的条目，那么：

> new.df <- data.frame(COL1 = 10, COL2 = paste(unique(unlist(strsplit(paste(df$COL2), split = " "))), collapse = " "))

结果：

> new.df
  COL1                        COL2
1   10 res prt kitty dog cat doggy

修改

要获得确切的答案，请尝试使用 dumb 暴力（导致R中的每个for被认为是错误的，我认为）brute force解决方案：

> str <- paste(df$COL2) > str [1] "res prt" "res" "kitty" "dog" "kitty cat" "doggy dog" > for(i in 2:length(str)) { Remaining.Words <- unlist(strsplit(str[1:i-1], split = " ")) My.Words <- unlist(strsplit(str[i], split = " ")) for(k in 1:length(My.Words)) { if(My.Words[k] %in% Remaining.Words) My.Words <- My.Words[-k] } if(length(My.Words) > 0) str[i] <- paste(My.Words, collapse = " ") else str <- str[-i] } > str [1] "res prt" "kitty" "dog" "cat" "doggy" "NA" > new.df <- data.frame(COL1 = 10, COL2 = paste(str[-6], collapse = ","))

Result_2.0：

> new.df COL1 COL2 1 10 res prt,kitty,dog,cat,doggy

Answer 2

这是一个简单的例子：

# a text column
txt <- c("foo bar", "bar", "foo")

# split it into words
words <- unlist(strsplit(txt, " "))

# return the unique values of this
unique(words)
[1] "foo" "bar"

有意义吗？如果你想将它们拼接出来，你可以说：

cat(unique(words))

Answer 3

您可以使用dplyr，尝试：

df <- data.frame(COL1 = c(rep(10, 4), rep(20, 3)),
                 COL2 = c("res prt", "res", "kitty", "kitty cat",
                          "res", "kitty", "kitty cat"),
                 stringsAsFactors = FALSE)
##  COL1      COL2
## 1   10   res prt
## 2   10       res
## 3   10     kitty
## 4   10 kitty cat
## 5   20       res
## 6   20     kitty
## 7   20 kitty cat

library(dplyr)
makeString <- function(x) {
  res <- unlist(strsplit(x, " "))
  res <- unique(res)
  paste(res, collapse = ", ")
}

df %>%  group_by(COL1) %>% summarise_all(makeString)

这会给你：

## A tibble: 2 × 2
##   COL1                 COL2
##  <dbl>                <chr>
## 1    10 res, prt, kitty, cat
## 2    20      res, kitty, cat

R - 合并列值

3 个答案: