加入data.table并保留列ID

时间:2018-08-26 22:43:31

标签: r dataframe join data.table

我想在data.table中执行联接后保留列(数据框中的“ person”)。我能够获得接近所需输出的内容,但是由于我对data.table的经验有限,因此需要在data.table和dplyr之间进行切换:

此处为数据框:

df<-structure(list(person = c("p1", "p1", "p1", "p1", "p1", "p1", 
"p1", "p2", "p2", "p2", "p3", "p3", "p3", "p4", "p4", "p4", "p5", 
"p5", "p5", "p6", "p6", "p6", "p7", "p7", "p7"), hp_char = c("hp1", 
"hp2", "hp3", "hp4", "hp5", "hp6", "hp7", "hp8", "hp9", "hp10", 
"hp1", "hp2", "hp3", "hp5", "hp6", "hp7", "hp8", "hp9", "hp10", 
"hp3", "hp4", "hp5", "hp1", "hp2", "hp3")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("person", 
"hp_char"), spec = structure(list(cols = structure(list(person = structure(list(), class = c("collector_character", 
"collector")), hp_char = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("person", "hp_char")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

我正在做一个自我联接,以获取以下任意两个“ hp_id”同时出现的实例数(类似于this question中阐述的内容)。我将{person}保留在by=.(...)中,以查看谁参与了共现组合(例如,在p1,p3和p7中共同出现的hp1和hp2):

df_by2<- setDT(df)[df, on = "person", allow = TRUE][
    hp_char < i.hp_char, .N, by = .(person ,HP_ID1 = hp_char, HP_ID2 = i.hp_char)]

但是,由于在by =.(person,...中包括“ person”,所以根据“ person”,“ hp_id”和“ hp_id2”的组合来分隔计数(= N)。因此,我切换到dplyr以接近我想要的目标,如下所示。

dfx<- df_by2 %>% group_by(HP_ID1,HP_ID2) %>% mutate (counts=length(person)) %>% spread(person,person) %>% select (-N) %>% unique() %>% filter(counts>1) %>% unite(person,p1:p7, sep="") %>% mutate (involved_id=gsub('?NA', ' ', person)) %>% select (-person)

这是我得到的输出:

# A tibble: 12 x 4
   HP_ID1 HP_ID2 counts   involved_id
    <chr>  <chr>  <int>      <chr>
 1    hp1    hp2      3 p1 p3   p7
 2    hp1    hp3      3 p1 p3   p7
 3   hp10    hp8      2   p2  p5  
 4   hp10    hp9      2   p2  p5  
 5    hp2    hp3      3 p1 p3   p7
 6    hp3    hp4      2  p1    p6 
 7    hp3    hp5      2  p1    p6 
 8    hp4    hp5      2  p1    p6 
 9    hp5    hp6      2  p1  p4   
10    hp5    hp7      2  p1  p4   
11    hp6    hp7      2  p1  p4   
12    hp8    hp9      2   p2  p5 

这很接近,但是所需的输出(格式正确,尽管“ involved_id”列不整洁)为:

# A tibble: 12 x 4
   HP_ID1 HP_ID2 counts   involved_id
    <chr>  <chr>  <int>      <chr>
 1    hp1    hp2      3 p1, p3, p7
 2    hp1    hp3      3 p1, p3, p7
 3   hp10    hp8      2     p2, p5
 4   hp10    hp9      2     p2, p5
 5    hp2    hp3      3 p1, p3, p7
 6    hp3    hp4      2     p1, p6
 7    hp3    hp5      2     p1, p6
 8    hp4    hp5      2     p1, p6
 9    hp5    hp6      2     p1, p4
10    hp5    hp7      2     p1, p4
11    hp6    hp7      2     p1, p4
12    hp8    hp9      2     p2, p5

所有这些都很麻烦,我想知道是否有一种更简单的方法。我最近刚接触过data.table并喜欢学习它。非常感谢使用data.table的任何帮助。

2 个答案:

答案 0 :(得分:2)

从先前发布的答案here(为方便起见,也复制到此处)继续,请使用.(.N, involved_id=paste(x.person, collapse=", "))作为最终所需的输出:

library(data.table)
setDT(df)

nset <- 3
cols <- paste0("hp_char", seq_len(nset))

#create combinations of nset number of skills
combi <- do.call(CJ, rep(df[,.(unique(hp_char))], nset))
setnames(combi, cols)

#create for each person the combinations of nset number of skills
nsetSkills <- df[, do.call(CJ, rep(.(hp_char), nset)), by=.(person)]
setnames(nsetSkills, names(nsetSkills)[-1L], cols)

ans <- nsetSkills[combi, on=cols, 
    .(.N, involved_id=paste(x.person, collapse=", ")), by=.EACHI]
ans

输出:

      hp_char1 hp_char2 hp_char3 N involved_id
   1:      hp1      hp1      hp1 3  p1, p3, p7
   2:      hp1      hp1     hp10 0          NA
   3:      hp1      hp1      hp2 3  p1, p3, p7
   4:      hp1      hp1      hp3 3  p1, p3, p7
   5:      hp1      hp1      hp4 1          p1
  ---                                         
 996:      hp9      hp9      hp5 0          NA
 997:      hp9      hp9      hp6 0          NA
 998:      hp9      hp9      hp7 0          NA
 999:      hp9      hp9      hp8 2      p2, p5
1000:      hp9      hp9      hp9 2      p2, p5

答案 1 :(得分:1)

也许您对“全部tidyverse”方法(使用combn加上摘要进行自我加入)感兴趣?

df %>%
    group_by(person) %>%
    summarise(tmp = list(setNames(
        as_tibble(t(combn(hp_char, 2))),
        c("HP_ID1", "HP_ID2")))) %>%
    unnest() %>%
    group_by(HP_ID1, HP_ID2) %>%
    summarise(
        counts = n(),
        involved_id = toString(person)) %>%
    filter(counts > 1)
## A tibble: 12 x 4
## Groups:   HP_ID1 [8]
#   HP_ID1 HP_ID2 counts involved_id
#   <chr>  <chr>   <int> <chr>
# 1 hp1    hp2         3 p1, p3, p7
# 2 hp1    hp3         3 p1, p3, p7
# 3 hp2    hp3         3 p1, p3, p7
# 4 hp3    hp4         2 p1, p6
# 5 hp3    hp5         2 p1, p6
# 6 hp4    hp5         2 p1, p6
# 7 hp5    hp6         2 p1, p4
# 8 hp5    hp7         2 p1, p4
# 9 hp6    hp7         2 p1, p4
#10 hp8    hp10        2 p2, p5
#11 hp8    hp9         2 p2, p5
#12 hp9    hp10        2 p2, p5