此问题是基于how to get quick summary of count in data.table的扩展程序。
同样,这是功能工程的一部分,通过回顾某个时间窗口,根据名为Col 的列汇总每个ID。相同的预处理将应用于测试集。由于数据集很大,因此可能更优选基于数据表的解决方案。
培训输入:
ID Time Col Count
A 2017-06-05 M 1
A 2017-06-02 M 1
A 2017-06-03 M 1
B 2017-06-02 K 1
B 2017-06-01 M 4
通过应用两天回顾,我们有:
ID Time Time-2D Col Count
A 2017-06-05 2017-06-03 M 1 #Time-2D by moving time two days back
A 2017-06-02 2017-05-31 M 1
A 2017-06-03 2017-06-01 M 1
B 2017-06-02 2017-05-31 K 1
B 2017-06-01 2017-05-30 M 4
预期输出(计数):
ID Time Time-2D Col_M Col_K
A 2017-06-05 2017-06-03 1 0 #from 2017-06-03 to 2017-06-05, for user A, there are 0 (sum(count)) of K and 1 (sum(count)) of M.
A 2017-06-02 2017-05-31 1 0
A 2017-06-03 2017-06-01 2 0 # 2 is because from 06-01 to 06-03, there is two rows in the first table (A 2017-06-02 M 1; A 2017-06-03 M 1) that the count summarization on M is 2.
B 2017-06-02 2017-05-31 0 1
B 2017-06-01 2017-05-30 4 0
基于上表, 预期产出(比率):
ID Time Time-2D Col_M Col_K
A 2017-06-05 2017-06-03 1 0 # 1/sum(1+0)
A 2017-06-02 2017-05-31 1 0
A 2017-06-03 2017-06-01 1 0 #2/sum(2+0)
B 2017-06-02 2017-05-31 0 1
B 2017-06-01 2017-05-30 1 0 # 4/sum(4+0)
以上是处理训练数据。对于测试数据集,如果要求映射Col_M,Col_K,意味着,如果其他值如S出现在Col中,则将被忽略。
答案 0 :(得分:1)
我想我理解你的要求。您似乎关心观察的顺序,例如,无论第二次观察Time
是否在第一次观察Time
之前。这没有多大意义,但这是一个退出高效的data.table解决方案,以实现这一目标。这基本上是由ID
,Col
,Time
列和两个非等于加入行索引(这是基本上是外观顺序)。之后,它只需dcast
即可从长到宽转换(就像您之前的问题一样)。请注意,结果按日期排序,但我保留了rowindx
变量,因此您可以使用setorder
重新排序。此外,我将保持比率calc给你,因为这是非常基本的(提示 - Don&#t; t 使用循环,它是一个完全矢量化的一个衬垫)
library(data.table) #v1.10.4+
## Read the data
DT <- fread("ID Time Col Count
A 2017-06-05 M 1
A 2017-06-02 M 1
A 2017-06-03 M 1
B 2017-06-02 K 1
B 2017-06-01 M 4")
## Prepare the variables we need for the join
DT[, Time := as.IDate(Time)]
DT[, Time_2D := Time - 2L]
DT[, rowindx := .I]
## Non-equi join, sum `Count` by each join
DT2 <- DT[DT,
sum(Count),
on = .(ID, Col, rowindx <= rowindx, Time <= Time, Time >= Time_2D),
by = .EACHI]
## Fix column names (a known issue)
setnames(DT2, make.unique(names(DT2)))
## Long to wide (You can reorder back using `rowindx` and `setorder` function)
dcast(DT2, ID + Time + Time.1 + rowindx ~ Col, value.var = "V1", fill = 0)
# ID Time Time.1 rowindx K M
# 1: A 2017-06-02 2017-05-31 2 0 1
# 2: A 2017-06-03 2017-06-01 3 0 2
# 3: A 2017-06-05 2017-06-03 1 0 1
# 4: B 2017-06-01 2017-05-30 5 0 4
# 5: B 2017-06-02 2017-05-31 4 1 0
答案 1 :(得分:0)
你可以尝试
dt <- fread("ID Time Time-2D Col Count
A 2017-06-05 2017-06-03 M 1
A 2017-06-02 2017-05-31 M 1
A 2017-06-03 2017-06-01 M 1
B 2017-06-02 2017-05-31 K 1
B 2017-06-01 2017-05-30 M 4")
dt1 <- dcast(dt, ID+Time+`Time-2D`~Col, value.var = c("Count"))
dt1[, K := ifelse(is.na(K), 0, K)]
dt1[, M := ifelse(is.na(M), 0, M)]
ID Time Time-2D K M
1: A 2017-06-02 2017-05-31 0 1
2: A 2017-06-03 2017-06-01 0 1
3: A 2017-06-05 2017-06-03 0 1
4: B 2017-06-01 2017-05-30 0 4
5: B 2017-06-02 2017-05-31 1 0
dt1[, Col_K := K/(K+M)]
dt1[, Col_M := M/(K+M)]
ID Time Time-2D K M Col_K Col_M
1: A 2017-06-02 2017-05-31 0 1 0 1
2: A 2017-06-03 2017-06-01 0 1 0 1
3: A 2017-06-05 2017-06-03 0 1 0 1
4: B 2017-06-01 2017-05-30 0 4 0 1
5: B 2017-06-02 2017-05-31 1 0 1 0
也许你可以结合最后两行。
之类的东西dt1[, `:=`()]