我想在data.table中执行联接后保留列(数据框中的“ person”)。我能够获得接近所需输出的内容,但是由于我对data.table的经验有限,因此需要在data.table和dplyr之间进行切换:
此处为数据框:
df<-structure(list(person = c("p1", "p1", "p1", "p1", "p1", "p1",
"p1", "p2", "p2", "p2", "p3", "p3", "p3", "p4", "p4", "p4", "p5",
"p5", "p5", "p6", "p6", "p6", "p7", "p7", "p7"), hp_char = c("hp1",
"hp2", "hp3", "hp4", "hp5", "hp6", "hp7", "hp8", "hp9", "hp10",
"hp1", "hp2", "hp3", "hp5", "hp6", "hp7", "hp8", "hp9", "hp10",
"hp3", "hp4", "hp5", "hp1", "hp2", "hp3")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("person",
"hp_char"), spec = structure(list(cols = structure(list(person = structure(list(), class = c("collector_character",
"collector")), hp_char = structure(list(), class = c("collector_character",
"collector"))), .Names = c("person", "hp_char")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
我正在做一个自我联接,以获取以下任意两个“ hp_id”同时出现的实例数(类似于this question中阐述的内容)。我将{person}保留在by=.(...)
中,以查看谁参与了共现组合(例如,在p1,p3和p7中共同出现的hp1和hp2):
df_by2<- setDT(df)[df, on = "person", allow = TRUE][
hp_char < i.hp_char, .N, by = .(person ,HP_ID1 = hp_char, HP_ID2 = i.hp_char)]
但是,由于在by =.(person,...
中包括“ person”,所以根据“ person”,“ hp_id”和“ hp_id2”的组合来分隔计数(= N)。因此,我切换到dplyr以接近我想要的目标,如下所示。
dfx<- df_by2 %>% group_by(HP_ID1,HP_ID2) %>% mutate (counts=length(person)) %>% spread(person,person) %>% select (-N) %>% unique() %>% filter(counts>1) %>% unite(person,p1:p7, sep="") %>% mutate (involved_id=gsub('?NA', ' ', person)) %>% select (-person)
这是我得到的输出:
# A tibble: 12 x 4
HP_ID1 HP_ID2 counts involved_id
<chr> <chr> <int> <chr>
1 hp1 hp2 3 p1 p3 p7
2 hp1 hp3 3 p1 p3 p7
3 hp10 hp8 2 p2 p5
4 hp10 hp9 2 p2 p5
5 hp2 hp3 3 p1 p3 p7
6 hp3 hp4 2 p1 p6
7 hp3 hp5 2 p1 p6
8 hp4 hp5 2 p1 p6
9 hp5 hp6 2 p1 p4
10 hp5 hp7 2 p1 p4
11 hp6 hp7 2 p1 p4
12 hp8 hp9 2 p2 p5
这很接近,但是所需的输出(格式正确,尽管“ involved_id”列不整洁)为:
# A tibble: 12 x 4
HP_ID1 HP_ID2 counts involved_id
<chr> <chr> <int> <chr>
1 hp1 hp2 3 p1, p3, p7
2 hp1 hp3 3 p1, p3, p7
3 hp10 hp8 2 p2, p5
4 hp10 hp9 2 p2, p5
5 hp2 hp3 3 p1, p3, p7
6 hp3 hp4 2 p1, p6
7 hp3 hp5 2 p1, p6
8 hp4 hp5 2 p1, p6
9 hp5 hp6 2 p1, p4
10 hp5 hp7 2 p1, p4
11 hp6 hp7 2 p1, p4
12 hp8 hp9 2 p2, p5
所有这些都很麻烦,我想知道是否有一种更简单的方法。我最近刚接触过data.table并喜欢学习它。非常感谢使用data.table的任何帮助。
答案 0 :(得分:2)
从先前发布的答案here(为方便起见,也复制到此处)继续,请使用.(.N, involved_id=paste(x.person, collapse=", "))
作为最终所需的输出:
library(data.table)
setDT(df)
nset <- 3
cols <- paste0("hp_char", seq_len(nset))
#create combinations of nset number of skills
combi <- do.call(CJ, rep(df[,.(unique(hp_char))], nset))
setnames(combi, cols)
#create for each person the combinations of nset number of skills
nsetSkills <- df[, do.call(CJ, rep(.(hp_char), nset)), by=.(person)]
setnames(nsetSkills, names(nsetSkills)[-1L], cols)
ans <- nsetSkills[combi, on=cols,
.(.N, involved_id=paste(x.person, collapse=", ")), by=.EACHI]
ans
输出:
hp_char1 hp_char2 hp_char3 N involved_id
1: hp1 hp1 hp1 3 p1, p3, p7
2: hp1 hp1 hp10 0 NA
3: hp1 hp1 hp2 3 p1, p3, p7
4: hp1 hp1 hp3 3 p1, p3, p7
5: hp1 hp1 hp4 1 p1
---
996: hp9 hp9 hp5 0 NA
997: hp9 hp9 hp6 0 NA
998: hp9 hp9 hp7 0 NA
999: hp9 hp9 hp8 2 p2, p5
1000: hp9 hp9 hp9 2 p2, p5
答案 1 :(得分:1)
也许您对“全部tidyverse
”方法(使用combn
加上摘要进行自我加入)感兴趣?
df %>%
group_by(person) %>%
summarise(tmp = list(setNames(
as_tibble(t(combn(hp_char, 2))),
c("HP_ID1", "HP_ID2")))) %>%
unnest() %>%
group_by(HP_ID1, HP_ID2) %>%
summarise(
counts = n(),
involved_id = toString(person)) %>%
filter(counts > 1)
## A tibble: 12 x 4
## Groups: HP_ID1 [8]
# HP_ID1 HP_ID2 counts involved_id
# <chr> <chr> <int> <chr>
# 1 hp1 hp2 3 p1, p3, p7
# 2 hp1 hp3 3 p1, p3, p7
# 3 hp2 hp3 3 p1, p3, p7
# 4 hp3 hp4 2 p1, p6
# 5 hp3 hp5 2 p1, p6
# 6 hp4 hp5 2 p1, p6
# 7 hp5 hp6 2 p1, p4
# 8 hp5 hp7 2 p1, p4
# 9 hp6 hp7 2 p1, p4
#10 hp8 hp10 2 p2, p5
#11 hp8 hp9 2 p2, p5
#12 hp9 hp10 2 p2, p5