我正在尝试汇总我的数据以找到相关性/模式,并想发现数据可能如何关联以及在何处关联。具体来说,我想确定ID(此处称为“项目”)一起出现多少次。有没有一种方法可以找到每个(id)一起出现多少次?
这是针对已根据此特定查询清理和聚合的较大data.frame。过去,我曾尝试从“ data.table”,“ dplyr”和“ tidyverse”等程序包中应用多个聚合,求和和过滤功能,但无法完全满足我的需求。
在第3节(显示一些代码)中,我提供了一个最小的可重现示例:
set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:12350,2000,replace = T)
item=sample(random.people,2000,replace=T)
sample_data <- data.frame(cbind(number,item), stringsAsFactors = FALSE)
使用示例here,我希望将名称组合为数字并显示n(值)的所有组合的输出标识为ID,并期望结果类似于:
Pair value
Bob, Tim 2
Bob, Jackie 4
Bob, Angie 0
此输出(我希望得到)将告诉我,在整个df中, Bob和Tim 是2倍,而 Bob和Jackie 是4倍。 >两者的编号相同。
但实际输出是:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2000 rows:
* 9, 23, 37, 164, 170, 180, 211...
更新:我想到了一个..creative(?)解决方案-但希望有人可以帮助它进行加速。我可以使用以下命令找到两个名称之间共享的所有数字(column1):
x1<-sample_data %>% dplyr::filter(item=="Bob")
x2<-sample_data %>% dplyr::filter(item=="Tim")
Bob<-x1[,1]
Tim<-x2[,1]
Reduce(intersect, list(Bob,Tim))
输出:
[1] "12345" "12348" "12350" "12346" "12349" "12347"
就像我说的那样,这非常耗时,并且需要创建过多的向量,并将每个向量(例如,每个名称使用1个向量)和多个组合相交。
答案 0 :(得分:3)
set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:22350,2000,replace = T) # I edited ur number here.
item=sample(random.people,2000,replace=T)
sample_data <- data.frame(cbind(number,item), stringsAsFactors = FALSE)
library(tidyverse)
sample_data %>%
# find out unique rows
distinct() %>%
# nest the data frame into nested tibble, so now you have
# a "data" column, which is a list of small data frames.
group_nest(number) %>%
# Here we use purrr::map to modify the list column. We want each
# combination counts only once despite the order, so we use sort.
mutate(data = map_chr(data, ~paste(sort(.x$item), collapse = ", "))) %>%
# the last two steps just count the numbers
group_by(data) %>%
count()
# A tibble: 21 x 2
# Groups: data [21]
data n
<chr> <int>
1 Angie 336
2 Angie, Bob 8
3 Angie, Bob, Christopher 2
4 Angie, Bob, Jackie 1
5 Angie, Christopher 16
6 Angie, Jackie 9
7 Angie, Tim 10
8 Bob 331
9 Bob, Christopher 12
10 Bob, Christopher, Jackie 1
# … with 11 more rows
一种可能的解决方案
答案 1 :(得分:0)
这是一个基础的R解决方案,它依赖于table
-> aggregate
,并且可能是一种使用apply
将名称粘贴在一起的低效率方法。
tab_data <- data.frame(unclass(table(unique(sample_data))))
#table results in columns c(Angie.1, Bob.1, ...) - this makes it look better
names(tab_data) = sort(random.people)
tab_data$n <- 1
agg_data <- aggregate(n~., data = tab_data, FUN = length)
agg_data$Pair <- apply(agg_data[, -length(agg_data)], 1, function(x) paste(names(x[x!=0]), collapse = ', '))
agg_data[order(agg_data$Pair), c('Pair', 'n') ]
Pair n
1 Angie 336
3 Angie, Bob 8
7 Angie, Bob, Christopher 2
11 Angie, Bob, Jackie 1
5 Angie, Christopher 16
9 Angie, Jackie 9
15 Angie, Tim 10
2 Bob 331
6 Bob, Christopher 12
... truncated ...
就性能而言,在这个相对较小的数据集上,它比dplyr解决方案快约9倍:
Unit: milliseconds
expr min lq mean median uq max neval
base_solution 9.4795 9.65215 10.80984 9.87625 10.32125 46.8230 100
dplyr_solution 78.6070 81.72155 86.47891 83.96435 86.40495 200.7784 100
数据
set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:22350,2000,replace = T) # I edited ur number here.
item=sample(random.people,2000,replace=T)
sample_data <- data.frame(number,item, n = 1L, stringsAsFactors = FALSE)