我面对的是以下data.table,结果是重复变量,但只是针对一次观察。
考虑最小的例子:
subdomain redirect uri
基本上,数据中的问题是同一个客户可能会被考虑两次,因为名称是拼写错误(在最小的例子中考虑Alice / Alicia和Sara / Sarah),但ID告诉我,它是#s同一个客户。所以我想清理数据以删除Customer + ID,以防ID已在同一观察中显示 。理想情况下,最终DT看起来像
DT = data.table(Firm = c("Firm1", "Firm2", "Firm3", "Firm4"), Customer1=c("Alice", "Bob", "Alice", "Bob"), ID1=c("1", "2", "1", "2"), Customer2=c("Charly", "Sarah", "Alicia", "Jack"), ID2=c("3", "4", "1", "5"), Customer3=c("Kevin", "Sara", "Deborah", "NA"), ID3=c("6", "4", "7", "NA"))
Firm Customer1 ID1 Customer2 ID2 Customer3 ID3
Firm1 Alice 1 Charly 3 Kevin 6
Firm2 Bob 2 Sarah 4 Sara 4
Firm3 Alice 1 Alicia 1 Deborah 7
Firm4 Bob 2 Jack 5 NA NA
甚至更好:
Firm Customer1 ID1 Customer2 ID2 Customer3 ID3
Firm1 Alice 1 Charly 3 Kevin 6
Firm2 Bob 2 Sarah 4 NA NA
Firm3 Alice 1 NA NA Deborah 7
Firm4 Bob 2 Jack 5 NA NA
数据集非常大,所以如果可能的话,我希望避免必须遍历每一行并比较Customer和ID的几种组合。有人知道我不知道的有效解决方案吗?
答案 0 :(得分:3)
在您描述的情况下,我会melt
将数据格式化为长格式,然后通过Firm
和ID
删除重复项unique
- 函数,添加新的每个rowid
Firm
,最后使用dcast
再次将其重新整理为宽格式。
使用:
DT.l <- melt(DT, id = 1, measure.vars = list(c(2,4,6), c(3,5,7)),
value.name = c('Customer','ID'))
DT.w <- dcast(unique(DT.l, by = c('Firm','ID'))[, variable := rowid(Firm)],
Firm ~ variable, value.var = c('Customer','ID'))
setcolorder(DT.w, c(1:2,5,3,6,4,7))
给出:
> DT.w Firm Customer_1 ID_1 Customer_2 ID_2 Customer_3 ID_3 1: Firm1 Alice 1 Charly 3 Kevin 6 2: Firm2 Bob 2 Sarah 4 NA NA 3: Firm3 Alice 1 Deborah 7 NA NA 4: Firm4 Bob 2 Jack 5 NA NA
注意:
measure.vars
中指定列位置:measure.vars = patterns('Customer',"ID")
。setcolorder
按照所需输出中显示的顺序设置列,但这当然不一定需要。