我正在做一些数据清理/格式化,我想按名称然后按日期为每个记录添加一个唯一的标识符。例如,“鲍勃”可能有四个入住日期,其中两个是相同的。对于这种情况,我想给他三个不同的(顺序)ID号。
这是我最接近理想结果的地方:
我创建的示例数据集:
tst <- data_frame(
name = c("Bob", "Sam", "Roger", "Stacy", "Roger", "Roger", "Sam", "Bob", "Sam", "Stacy", "Bob", "Stacy", "Roger", "Bob"),
date = as.Date(c("2009-07-03", "2010-08-12", "2009-07-03", "2016-04-01", "2002-01-03", "2019-02-10", "2005-04-17", "2009-07-03", "2010-09-21", "2012-11-12", "2015-12-31", "2014-10-10", "2015-06-02", "2003-08-21")),
amount = round(runif(14, 0, 100), 2)
)
正在生成check_in_number
变量...
tst2 <- tst %>%
arrange(date) %>%
group_by(name, date) %>%
mutate(check_in_number = row_number())
上面的行将按以下顺序为Bob生成check_in_number
:1
,1
,2
,1
。相反,我希望输出为1
,2
,2
,3
。换一种说法。我希望将同一日期的签到实例视为一次签到。
tidyverse可能吗?我是否忽略了一个简单的方法?
这里有一个类似的问题,但是我将其遗漏了,因为我涉及的问题是我安排了数据的有序日期变量。换句话说,我的数据要求我的新变量必须连续。
How to number/label data-table by group-number from group_by?
答案 0 :(得分:5)
您需要group_indices
:
library(tidyverse)
tst <- tibble(
name = c("Bob", "Sam", "Roger", "Stacy", "Roger", "Roger", "Sam", "Bob", "Sam", "Stacy", "Bob", "Stacy", "Roger", "Bob"),
date = as.Date(c("2009-07-03", "2010-08-12", "2009-07-03", "2016-04-01", "2002-01-03", "2019-02-10", "2005-04-17", "2009-07-03", "2010-09-21", "2012-11-12", "2015-12-31", "2014-10-10", "2015-06-02", "2003-08-21")),
amount = round(runif(14, 0, 100), 2)
)
tst %>%
arrange(name, date) %>%
mutate(check_in_number = group_indices(., name, date))
#> # A tibble: 14 x 4
#> name date amount check_in_number
#> <chr> <date> <dbl> <int>
#> 1 Bob 2003-08-21 91.1 1
#> 2 Bob 2009-07-03 38.1 2
#> 3 Bob 2009-07-03 28.3 2
#> 4 Bob 2015-12-31 22.3 3
#> 5 Roger 2002-01-03 68.3 4
#> 6 Roger 2009-07-03 83.8 5
#> 7 Roger 2015-06-02 94.2 6
#> 8 Roger 2019-02-10 48.8 7
#> 9 Sam 2005-04-17 16.6 8
#> 10 Sam 2010-08-12 93.2 9
#> 11 Sam 2010-09-21 65.5 10
#> 12 Stacy 2012-11-12 92.6 11
#> 13 Stacy 2014-10-10 84.4 12
#> 14 Stacy 2016-04-01 7.43 13
如果您需要重新编号以重新命名每个名称,则可以根据每个名称中的第一个值重新缩放:
tst %>%
arrange(name, date) %>%
mutate(check_in_number = group_indices(., name, date)) %>%
group_by(name) %>%
mutate(check_in_number = check_in_number - first(check_in_number) + 1)
#> # A tibble: 14 x 4
#> # Groups: name [4]
#> name date amount check_in_number
#> <chr> <date> <dbl> <dbl>
#> 1 Bob 2003-08-21 91.1 1
#> 2 Bob 2009-07-03 38.1 2
#> 3 Bob 2009-07-03 28.3 2
#> 4 Bob 2015-12-31 22.3 3
#> 5 Roger 2002-01-03 68.3 1
#> 6 Roger 2009-07-03 83.8 2
#> 7 Roger 2015-06-02 94.2 3
#> 8 Roger 2019-02-10 48.8 4
#> 9 Sam 2005-04-17 16.6 1
#> 10 Sam 2010-08-12 93.2 2
#> 11 Sam 2010-09-21 65.5 3
#> 12 Stacy 2012-11-12 92.6 1
#> 13 Stacy 2014-10-10 84.4 2
#> 14 Stacy 2016-04-01 7.43 3
由reprex package(v0.3.0)于2019-06-18创建
答案 1 :(得分:1)
带有data.table
library(data.table)
setDT(tst)[order(name, date)][, check_in_number := .GRP, .(name, date)][]
# name date amount check_in_number
# 1: Bob 2003-08-21 66.36 1
# 2: Bob 2009-07-03 22.18 2
# 3: Bob 2009-07-03 96.15 2
# 4: Bob 2015-12-31 31.64 3
# 5: Roger 2002-01-03 92.32 4
# 6: Roger 2009-07-03 41.85 5
# 7: Roger 2015-06-02 15.46 6
# 8: Roger 2019-02-10 80.38 7
# 9: Sam 2005-04-17 49.18 8
#10: Sam 2010-08-12 73.57 9
#11: Sam 2010-09-21 49.37 10
#12: Stacy 2012-11-12 24.82 11
#13: Stacy 2014-10-10 23.31 12
#14: Stacy 2016-04-01 80.12 13
如果我们需要重新开始编号
setDT(tst)[order(name, date)][, check_in_number := .GRP,
.(name, date)][, check_in_number := match(check_in_number,
unique(check_in_number)), .(name)][]
# name date amount check_in_number
# 1: Bob 2003-08-21 66.36 1
# 2: Bob 2009-07-03 22.18 2
# 3: Bob 2009-07-03 96.15 2
# 4: Bob 2015-12-31 31.64 3
# 5: Roger 2002-01-03 92.32 1
# 6: Roger 2009-07-03 41.85 2
# 7: Roger 2015-06-02 15.46 3
# 8: Roger 2019-02-10 80.38 4
# 9: Sam 2005-04-17 49.18 1
#10: Sam 2010-08-12 73.57 2
#11: Sam 2010-09-21 49.37 3
#12: Stacy 2012-11-12 24.82 1
#13: Stacy 2014-10-10 23.31 2
#14: Stacy 2016-04-01 80.12 3
tst <- data_frame(
name = c("Bob", "Sam", "Roger", "Stacy", "Roger", "Roger", "Sam", "Bob", "Sam", "Stacy", "Bob", "Stacy", "Roger", "Bob"),
date = as.Date(c("2009-07-03", "2010-08-12", "2009-07-03", "2016-04-01", "2002-01-03", "2019-02-10", "2005-04-17", "2009-07-03", "2010-09-21", "2012-11-12", "2015-12-31", "2014-10-10", "2015-06-02",
"2003-08-21")),
amount = round(runif(14, 0, 100), 2)
)