计算R中各列的唯一值

时间:2019-04-12 10:05:36

标签: r dataframe

我正在尝试创建一个新变量,该变量具有来自两个不同列的字符串值的唯一计数。所以我有这样的东西,例如:

# A tibble: 4 x 2
  names   partners                 
  <fct>   <fct>                    
1 John    Mary, Ashley, John, Kate 
2 Mary    Charlie, John, Mary, John
3 Charlie Kate, Marcy              
4 David   Mary, Claire 
structure(list(names = structure(c(3L, 4L, 1L, 2L), .Label = c("Charlie", 
"David", "John", "Mary"), class = "factor"), partners = structure(c(3L, 
1L, 2L, 4L), .Label = c("Charlie, John, Mary, John", "Kate, Marcy", 
"Mary, Ashley, John, Kate", "Mary, Claire"), class = "factor")), row.names = c(NA, 
4L), class = "data.frame")

我想得到这样的东西

# A tibble: 4 x 3
  names   partners                  uniquecounts
  <fct>   <fct>                            <dbl>
1 John    Mary, Ashley, John, Kate             4
2 Mary    Charlie, John, Mary, John            3
3 Charlie Kate, Marcy                          3
4 David   Mary, Claire                         3

我尝试将两列合并为一个,然后计算其中的唯一值,但这没有用。

3 个答案:

答案 0 :(得分:2)

使用tidyverse,首先将因子列转换为字符,使用map2并将partners拆分为单个字符串向量,然后使用{{ 1}}。

names

在基数R中具有相同的逻辑

n_distinct

答案 1 :(得分:0)

toString还有另一种方式。

dat$uniquecounts <- sapply(strsplit(apply(dat, 1, toString), ", "), 
                           function(x) length(unique(x)))

dat
#     names                  partners uniquecounts
# 1    John  Mary, Ashley, John, Kate            4
# 2    Mary Charlie, John, Mary, John            3
# 3 Charlie               Kate, Marcy            3
# 4   David              Mary, Claire            3

答案 2 :(得分:0)

这是一种使用tidyverse而不循环的方法

library(tidyverse)
df1 %>% 
   mutate(partners = str_c(names, partners, sep=", ")) %>%
   separate_rows(partners) %>%
   distinct %>% 
   count(names) %>% 
   right_join(df1)
# A tibble: 4 x 3
#  names       n partners                 
#  <fct>   <int> <fct>                    
#1 John        4 Mary, Ashley, John, Kate 
#2 Mary        3 Charlie, John, Mary, John
#3 Charlie     3 Kate, Marcy              
#4 David       3 Mary, Claire