Question

我有一个称为df的datframe。我想基于,对a，b和c列中的值进行字符串拆分，并为每列获取唯一元素的列，并对这些唯一元素进行计数，如下结果所示。我们如何在R中完成这项工作？感谢您的帮助。

a <- c("cat, cat, dog", "dog")
b<- c("cat")
c<- c("dog, dog", "cat")

df <- data.frame(position= c("1","2"),a, b, c, stringsAsFactors = F)

我想要的结果：

position    a_uniq  b_uniq  c_uniq  a_uniq_counts   b_uniq_counts   c_uniq_counts
1           cat,dog cat     dog     2               1               1
2           dog     cat     cat     1               1               1

Answer 1

我为您提出一个使用data.table的解决方案：

unique_counts <- function(str){
return(uniqueN(unlist(strsplit(gsub(" ", "" ,str), ","))))
}

unique_strings <- function(str){
  return(paste0(unique(unlist(strsplit(gsub(" ", "" ,str), ","))), collapse=","))
}

a <- c("cat, cat, dog", "dog")
b<- c("cat")
c<- c("dog, dog", "cat")


df <- data.frame(position= c("1","2"),a, b, c, stringsAsFactors = F)
df <- as.data.table(df)
for (i in colnames(df)[2:length(colnames(df))]){
  df[ , eval(paste0(i,"_uniq")):=mapply(unique_strings, get(i))]
  df[ , eval(paste0(i,"_uniq_counts")):=mapply(unique_counts, get(i))]
  df[ , eval(i):=NULL]
}

最好！

Answer 2

这是tidyverse的一个选项。使用mutate_at，在定界符,处分割字符串，并用uniqueN

获得唯一计数

library(tidyverse)
df %>% 
      mutate_at(vars(a:c), funs(uniq_counts = strsplit(., ", ") %>%
                  map_int(n_distinct)))
#  position             a   b        c a_uniq_counts b_uniq_counts c_uniq_counts
#1        1 cat, cat, dog cat dog, dog             2             1             1
#2        2           dog cat      cat             1             1             1

如何在表的每个单元格中进行字符串拆分后获取唯一值和计数

2 个答案: