我的数据如下:
data <- matrix(c("1","install","2015-10-23 14:07:20.000000",
"2","install","2015-10-23 14:08:20.000000",
"3","install","2015-10-23 14:07:25.000000",
"3","sale","2015-10-23 14:08:20.000000",
"4","install","2015-10-23 14:07:20.000000",
"4","sale","2015-10-23 14:09:20.000000",
"4","sale","2015-10-23 14:11:20.000000"),
ncol=3, byrow=TRUE)
colnames(data) <- c("id","event","time")
我想添加一个名为label的第四列,其中我相应地标记了某些值上的每一行。在这种情况下:
等等至n个销售。
它应该最终看起来像:
data1 <- matrix(c("1","install","2015-10-23 14:07:20.000000","0",
"2","install","2015-10-23 14:08:20.000000","0",
"3","install","2015-10-23 14:07:25.000000","1",
"3","sale","2015-10-23 14:08:20.000000","1",
"4","install","2015-10-23 14:07:20.000000","2",
"4","sale","2015-10-23 14:09:20.000000","2",
"4","sale","2015-10-23 14:11:20.000000","2"),
ncol=4, byrow=TRUE)
我不清楚在R中根据条件创建“标签”的最佳方法是什么......也许dplyr::mutate
?
答案 0 :(得分:4)
使用base R
:
我们可以使用sum
使用"sale"
按id
计算ave
的出现次数。然后检查ID是uniq
唯一的。我们将"0"
分配给任何唯一的行。 cbind
将所有这些结合起来。我也转换为data.frame,因为没有任何理由将混合信息存储在矩阵中。
indx <- ave(data[,2], data[,1], FUN=function(x) sum(x == "sale"))
uniq <- table(data[,1]) == 1
indx[data[,1] %in% which(uniq)] <- "0"
cbind.data.frame(data, indx)
# id event time count
# 1 1 sale 2015-10-23 14:07:20.000000 0
# 2 2 install 2015-10-23 14:08:20.000000 0
# 3 3 install 2015-10-23 14:07:25.000000 1
# 4 3 sale 2015-10-23 14:08:20.000000 1
# 5 4 install 2015-10-23 14:07:20.000000 2
# 6 4 sale 2015-10-23 14:09:20.000000 2
# 7 4 sale 2015-10-23 14:11:20.000000 2
答案 1 :(得分:4)
更新以反映“以及最多n个销售额。” - 要求。
dplyr选项可以是:
library(dplyr)
data <- as.data.frame(data)
data %>%
group_by(id) %>%
mutate(label = if(n() == 1) 0 else as.numeric(sum(event == "sale")))
#Source: local data frame [7 x 4]
#Groups: id [4]
#
# id event time label
# (fctr) (fctr) (fctr) (dbl)
#1 1 install 2015-10-23 14:07:20.000000 0
#2 2 install 2015-10-23 14:08:20.000000 0
#3 3 install 2015-10-23 14:07:25.000000 1
#4 3 sale 2015-10-23 14:08:20.000000 1
#5 4 install 2015-10-23 14:07:20.000000 2
#6 4 sale 2015-10-23 14:09:20.000000 2
#7 4 sale 2015-10-23 14:11:20.000000 2
data.table等价物是:
library(data.table)
data <- as.data.table(data) # or setDT(data) if it's already a data.frame
data[, label := if(.N == 1) 0 else as.numeric(sum(event == "sale")), by=id]
答案 2 :(得分:0)
使用汇总值添加列的另一种dplyr方法是在另一个表中创建汇总变量,然后将其连接回主data.frame,如下所示:
library(dplyr)
left_join(data,
data %>%
group_by(id) %>%
summarise(count = n(), sales = sum(event == "sale"))
) %>%
mutate(label = ifelse(count == 1, 0, sales)) %>%
select(-count, -sales)
> data
id event time label
1 1 install 2015-10-23 14:07:20.000000 0
2 2 install 2015-10-23 14:08:20.000000 0
3 3 install 2015-10-23 14:07:25.000000 1
4 3 sale 2015-10-23 14:08:20.000000 1
5 4 install 2015-10-23 14:07:20.000000 2
6 4 sale 2015-10-23 14:09:20.000000 2
7 4 sale 2015-10-23 14:11:20.000000 2