我有一个这样的数据框:
ii <- data.frame(cid = c(rep('a',8),rep('b',5)),
Interaction = c(rep('VCS',3), c('SLS'), rep('TCU',2), rep('MFM',2), rep('SLS', 2), 'COMM', rep('MFM',2)),
stringsAsFactors = F
)
cid Interaction
1 a VCS
2 a VCS
3 a VCS
4 a SLS
5 a TCU
6 a TCU
7 a MFM
8 a MFM
9 b SLS
10 b SLS
11 b COMM
12 b MFM
13 b MFM
我想首先按cid
进行分组,然后创建另一个列,显示Interaction
列的重复次数。结果应如下所示:
cid Interaction replicate
1 a VCS 1
2 a VCS 2
3 a VCS 3
4 a SLS 1
5 a TCU 1
6 a TCU 2
7 a MFM 1
8 a MFM 2
9 b SLS 1
10 b SLS 2
11 b COMM 1
12 b MFM 1
13 b MFM 2
最终我还希望将其重新整形为宽格式(无法使用当前格式,因为我丢失了重复项),这类似于:
cid InteractionTuple
1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2 b SLS1;SLS2;COMM;MFM1;MFM2
能够运行关联规则挖掘技术,这些技术目前支持每个事务的重复项。
答案 0 :(得分:2)
使用dplyr:
library(dplyr)
ii %>%
group_by(cid, Interaction) %>%
mutate(Interaction_rn = paste0(Interaction, row_number())) %>%
group_by(cid) %>%
summarise(InteractionTuple = paste(Interaction_rn, collapse = ";"))
# # A tibble: 2 x 2
# cid InteractionTuple
# <chr> <chr>
# 1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
# 2 b SLS1;SLS2;COMM1;MFM1;MFM2
答案 1 :(得分:2)
这是一个data.table解决方案
library(data.table)
setDT(dt)
dt[ , "replicate" := 1:.N, by = .(Interaction, cid)]
cid Interaction replicate
1: a VCS 1
2: a VCS 2
3: a VCS 3
4: a SLS 1
5: a TCU 1
6: a TCU 2
7: a MFM 1
8: a MFM 2
9: b SLS 1
10: b SLS 2
11: b COMM 1
12: b MFM 1
13: b MFM 2
修改强> 第二部分:
dt2 = dt[ , .("InteractionTuple" = paste(Interaction, replicate, sep = "", collapse = ";")), by = .(cid)]
> dt2
cid InteractionTuple
1: a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2: b SLS1;SLS2;COMM1;MFM1;MFM2
<强> EDIT2 强>
@MikeH提出了一种可能更快的不同方式。结果如下
microbenchmark(dt2 = dt[ , .("replicate" = 1:.N), by = .(Interaction, cid)],
dt3 = dt[ , .("replicate" = seq_len(.N)), by = .(Interaction, cid)], times = 1000L)
Unit: microseconds
expr min lq mean median uq max neval
dt2 323.960 364.361 434.6370 402.8740 457.6220 2382.88 1000
dt3 318.296 360.585 508.1313 397.3985 461.5865 42750.25 1000
使用seq_len(.N)
,中位数会好一点。
答案 2 :(得分:1)
此答案基于dplyr
第一部分
Q1=ii%>%group_by(cid,Interaction)%>%
mutate(replicate=rank(Interaction,ties.method="first"))
Q1
cid Interaction replicate
<chr> <chr> <int>
1 a VCS 1
2 a VCS 2
3 a VCS 3
4 a SLS 1
5 a TCU 1
6 a TCU 2
7 a MFM 1
8 a MFM 2
9 b SLS 1
10 b SLS 2
11 b COMM 1
12 b MFM 1
13 b MFM 2
第二部分
Q2=Q1%>%group_by(cid)%>%
summarise(InteractionTuple=paste0(Interaction,replicate,collapse = ";"))
Q2
# A tibble: 2 × 2
cid InteractionTuple
<chr> <chr>
1 a VCS1;VCS2;VCS3;SLS1;TCU1;TCU2;MFM1;MFM2
2 b SLS1;SLS2;COMM1;MFM1;MFM2