Question

我有一个大数据集 - 大约32万行。我有关于电话号码，通话来源和目的地的信息。

对于每个电话号码，我想计算它作为Origin或Destination出现的次数。

示例数据表如下：

library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))

    Tel Origin Destination 
1:   1      1           3 
2:   2      2           4
3:   3      3           5 
4:   4      4           6 
5:   5      5           7

我有工作代码，但我的数据需要太长时间，因为它涉及for循环。我该如何优化呢？

这是：

for (i in unique(dt$Tel)){
    index <- (dt$Origin == i | dt$Destination == i)
    dt[dt$Tel ==i, "N"] <- sum(index)
}

结果：

    Tel Origin Destination N
1:   1      1           3  1
2:   2      2           4  1
3:   3      3           5  2
4:   4      4           6  2
5:   5      5           7  2

其中N表示Tel = 1出现1，Tel = 2出现1，Tel = 3,4和5出现2次。

Answer 1

我们可以melt和match

dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]

或另一种选择是循环显示第2列和第3列，使用%in%检查“电话”中的值是否为“电话”。存在，然后使用Reduce和+获取每个＆＃39; Tel＆＃39;的逻辑元素的总和，将值（:=）分配给＆＃39; N＆＃ 39;

dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
#   Tel Origin Destination N
#1:   1      1           3 1
#2:   2      2           4 1
#3:   3      3           5 2
#4:   4      4           6 2
#5:   5      5           7 2

Answer 2

第二种方法构造一个临时的data.table，然后将其连接到原始数据。这比@ akrun更长，可能效率更低，但看起来很有用。

# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
                 c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]

现在，加入

上的原始表格

dt <- temp[dt, on="Tel"]
dt
   Tel N Origin Destination
1:   1 1      1           3
2:   2 1      2           4
3:   3 2      3           5
4:   4 2      4           6
5:   5 2      5           7

您可以使用setcolorder

获取所需的列顺序

setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

总结值出现在2列中的任何一列的次数

2 个答案: