我有游戏会话数据的数据集(id,会话数,会话的第二个会话和每个id的会话日期)。这是mydat的样本:
mydat=read.csv("C:/Users/Admin/desktop/rty.csv", sep=";",dec=",")
mydat
structure(list(udid = c(74385162L, 79599601L, 79599601L, 91475825L,
91475825L, 91492531L, 92137561L, 96308016L, 96308016L, 96308016L,
96308016L, 96308016L, 96495076L, 97135620L, 97135620L, 97135620L,
97135620L, 97135620L, 97135620L, 97135620L, 97135620L, 97135620L,
97135620L, 97165942L), count = c(1L, 1L, 1L, 1L, 3L, 1L, 1L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), avg_duration = c(39L, 1216L, 568L, 5L, 6L, 79L, 9L, 426L,
78L, 884L, 785L, 785L, 22L, 302L, 738L, 280L, 2782L, 5L, 2284L,
144L, 234L, 231L, 539L, 450L), date = structure(c(13L, 3L, 3L,
1L, 1L, 14L, 2L, 11L, 11L, 11L, 12L, 12L, 9L, 7L, 4L, 4L, 5L,
6L, 8L, 8L, 8L, 8L, 8L, 10L), .Label = c("11.10.16", "12.12.16",
"15.11.16", "15.12.16", "16.12.16", "17.12.16", "18.10.16", "18.12.16",
"21.10.16", "26.10.16", "28.11.16", "29.11.16", "31.10.16", "8.10.16"
), class = "factor")), .Names = c("udid", "count", "avg_duration",
"date"), class = "data.frame", row.names = c(NA, -24L))
我需要在每个最后一个日期之后看到玩家被看到1,并且在看到这个id时将零换成另一个日期。
例如id 96308016
有5个ob。因此,我们用1标记最后的第五个观察,并且前面的4个观察标记为零。
如果id有1个观察值,我们用1标记它id 74385162
。
使其更清晰,这是我的预期输出
udid count avg_duration date diff
74385162 1 39 31.10.16 1
79599601 1 1216 15.11.16 0
79599601 1 568 15.11.16 1
91475825 1 5 11.10.16 0
91475825 3 6 11.10.16 1
91492531 1 79 8.10.16 1
92137561 1 9 12.12.16 1
96308016 2 426 28.11.16 0
96308016 2 78 28.11.16 0
96308016 1 884 28.11.16 0
96308016 1 785 29.11.16 0
96308016 1 785 29.11.16 1
96495076 1 22 21.10.16 1
97135620 2 302 18.10.16 0
97135620 1 738 15.12.16 0
97135620 1 280 15.12.16 0
97135620 1 2782 16.12.16 0
97135620 1 5 17.12.16 0
97135620 1 2284 18.12.16 0
97135620 1 144 18.12.16 0
97135620 1 234 18.12.16 0
97135620 1 231 18.12.16 0
97135620 1 539 18.12.16 1
97165942 1 450 26.10.16 1
怎么做?
答案 0 :(得分:3)
您可以执行以下操作:
library(dplyr)
mydat = mydat %>%
group_by(udid) %>%
mutate(diff=ifelse(row_number()==n(),1,0)) %>%
as.data.frame()
输出:
udid count avg_duration date diff
1 74385162 1 39 31.10.16 1
2 79599601 1 1216 15.11.16 0
3 79599601 1 568 15.11.16 1
4 91475825 1 5 11.10.16 0
5 91475825 3 6 11.10.16 1
6 91492531 1 79 8.10.16 1
7 92137561 1 9 12.12.16 1
8 96308016 2 426 28.11.16 0
9 96308016 2 78 28.11.16 0
10 96308016 1 884 28.11.16 0
11 96308016 1 785 29.11.16 0
12 96308016 1 785 29.11.16 1
13 96495076 1 22 21.10.16 1
14 97135620 2 302 18.10.16 0
15 97135620 1 738 15.12.16 0
16 97135620 1 280 15.12.16 0
17 97135620 1 2782 16.12.16 0
18 97135620 1 5 17.12.16 0
19 97135620 1 2284 18.12.16 0
20 97135620 1 144 18.12.16 0
21 97135620 1 234 18.12.16 0
22 97135620 1 231 18.12.16 0
23 97135620 1 539 18.12.16 1
24 97165942 1 450 26.10.16 1
答案 1 :(得分:2)
如果它已经按日期排序,那么这将起作用:
mydat$diff = as.integer(!duplicated(mydat$udid, fromLast = TRUE))
head(mydat)
# udid count avg_duration date diff
# 1 74385162 1 39 31.10.16 1
# 2 79599601 1 1216 15.11.16 0
# 3 79599601 1 568 15.11.16 1
# 4 91475825 1 5 11.10.16 0
# 5 91475825 3 6 11.10.16 1
# 6 91492531 1 79 8.10.16 1
如果它尚未按日期排序,请转换为Date
类,排序,然后执行以上操作:
mydat$date = as.Date(mydat$date, format = "%d.%M.%y")
mydat = mydat[order(mydat$udid, mydat$date), ]
答案 2 :(得分:1)
如果您不想按日期排序,那么逻辑答案应该通过以下方式实现:
mydat$date = as.Date(mydat$date, "%d.%M.%y")
mydat %>%
group_by(udid) %>%
mutate(diff = ifelse(date == max(date), 1L, 0L)) #Last date
udid count avg_duration date diff
<int> <int> <int> <date> <int>
1 74385162 1 39 2016-01-31 1
2 79599601 1 1216 2016-01-15 1
3 79599601 1 568 2016-01-15 1
4 91475825 1 5 2016-01-11 1
5 91475825 3 6 2016-01-11 1
6 91492531 1 79 2016-01-08 1
7 92137561 1 9 2016-01-12 1
8 96308016 2 426 2016-01-28 0
9 96308016 2 78 2016-01-28 0
10 96308016 1 884 2016-01-28 0
# ... with 14 more rows
但似乎,您的样本日期得到duplicate date
,这不允许上述逻辑工作。但该解决方案应该适用于实际数据,尤其是当date
位于date/time
时。