标记R中的唯一值

时间:2015-11-06 13:22:06

标签: r duplicates dplyr

我的数据如下:

data <- matrix(c("1","install","2015-10-23 14:07:20.000000",
                 "2","install","2015-10-23 14:08:20.000000",
                 "3","install","2015-10-23 14:07:25.000000",
                 "3","sale","2015-10-23 14:08:20.000000",
                 "4","install","2015-10-23 14:07:20.000000",
                 "4","sale","2015-10-23 14:09:20.000000",
                 "4","sale","2015-10-23 14:11:20.000000"),
               ncol=3, byrow=TRUE)
colnames(data) <- c("id","event","time")

我想添加一个名为label的第四列,其中我相应地标记了某些值上的每一行。在这种情况下:

  • 如果id是唯一的“0”标签
  • 如果ID不是唯一的并且已关联1次销售
  • ,则为“1”标签
  • 如果ID不是唯一的并且它具有关联的2个销售
  • ,则为“2”标签

等等至n个销售。

它应该最终看起来像:

data1 <- matrix(c("1","install","2015-10-23 14:07:20.000000","0",
                  "2","install","2015-10-23 14:08:20.000000","0",
                  "3","install","2015-10-23 14:07:25.000000","1",
                  "3","sale","2015-10-23 14:08:20.000000","1",
                  "4","install","2015-10-23 14:07:20.000000","2",
                  "4","sale","2015-10-23 14:09:20.000000","2",
                  "4","sale","2015-10-23 14:11:20.000000","2"),
                 ncol=4, byrow=TRUE)

我不清楚在R中根据条件创建“标签”的最佳方法是什么......也许dplyr::mutate

3 个答案:

答案 0 :(得分:4)

使用base R

我们可以使用sum使用"sale"id计算ave的出现次数。然后检查ID是uniq唯一的。我们将"0"分配给任何唯一的行。 cbind将所有这些结合起来。我也转换为data.frame,因为没有任何理由将混合信息存储在矩阵中。

indx <- ave(data[,2], data[,1], FUN=function(x) sum(x == "sale"))
uniq <- table(data[,1]) == 1
indx[data[,1] %in% which(uniq)] <- "0"
cbind.data.frame(data, indx)
#   id   event                       time count
# 1  1    sale 2015-10-23 14:07:20.000000     0
# 2  2 install 2015-10-23 14:08:20.000000     0
# 3  3 install 2015-10-23 14:07:25.000000     1
# 4  3    sale 2015-10-23 14:08:20.000000     1
# 5  4 install 2015-10-23 14:07:20.000000     2
# 6  4    sale 2015-10-23 14:09:20.000000     2
# 7  4    sale 2015-10-23 14:11:20.000000     2

答案 1 :(得分:4)

更新以反映“以及最多n个销售额。” - 要求。

dplyr选项可以是:

library(dplyr)
data <- as.data.frame(data)
data %>% 
  group_by(id) %>% 
  mutate(label = if(n() == 1) 0 else as.numeric(sum(event == "sale")))

#Source: local data frame [7 x 4]
#Groups: id [4]
#
#      id   event                       time label
#  (fctr)  (fctr)                     (fctr) (dbl)
#1      1 install 2015-10-23 14:07:20.000000     0
#2      2 install 2015-10-23 14:08:20.000000     0
#3      3 install 2015-10-23 14:07:25.000000     1
#4      3    sale 2015-10-23 14:08:20.000000     1
#5      4 install 2015-10-23 14:07:20.000000     2
#6      4    sale 2015-10-23 14:09:20.000000     2
#7      4    sale 2015-10-23 14:11:20.000000     2

data.table等价物是:

library(data.table)
data <- as.data.table(data)  # or setDT(data) if it's already a data.frame
data[, label := if(.N == 1) 0 else as.numeric(sum(event == "sale")), by=id]

答案 2 :(得分:0)

使用汇总值添加列的另一种dplyr方法是在另一个表中创建汇总变量,然后将其连接回主data.frame,如下所示:

library(dplyr)
left_join(data,
              data %>%
                group_by(id) %>%
                summarise(count = n(), sales = sum(event == "sale"))
) %>%
  mutate(label = ifelse(count == 1, 0, sales)) %>%
  select(-count, -sales)

> data
  id   event                       time label
1  1 install 2015-10-23 14:07:20.000000     0
2  2 install 2015-10-23 14:08:20.000000     0
3  3 install 2015-10-23 14:07:25.000000     1
4  3    sale 2015-10-23 14:08:20.000000     1
5  4 install 2015-10-23 14:07:20.000000     2
6  4    sale 2015-10-23 14:09:20.000000     2
7  4    sale 2015-10-23 14:11:20.000000     2