我具有以下数据,并且我想创建一个新变量,该变量考虑到前一时期的先前信息。例如,
moviewatched<- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama')
name<- c('john', 'john', 'john', 'john', 'john','kate','kate')
time<- c('1-2018', '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018')
df<- data.frame(moviewatched, name, time)
现在,我需要创建一个变量,以告诉他/她当月观看的新类型的流派电影。例如,在上述情况下,John在2018年的第一个月观看了2种类型的类型,并在第二个月观看了1种新的类型(因为他在第一个月已经观看了喜剧和恐怖片),我有什么方法可以创造一个人开始观看的新类型的连续计数?我想创建一个名为movietypewatched的变量,其中包含该人到该月为止所观看的所有流派类型的总数。预期的输出如下:
name time movietypewatched
john 1-2018 2
john 2-2018 3
kate 1-2018 1
kate 2-2018 2
谢谢
答案 0 :(得分:3)
使用dplyr
的解决方案。我们可以基于moviewatched
和name
删除重复的行,对唯一的moviewatched
进行计数,然后使用cumsum
计算运行总计。 df2
是最终输出。
library(dplyr)
df2 <- df %>%
distinct(moviewatched, name, .keep_all = TRUE) %>%
group_by(name, time) %>%
summarise(movietypewatched = n_distinct(moviewatched)) %>%
mutate(movietypewatched = cumsum(movietypewatched)) %>%
ungroup()
df2
# # A tibble: 4 x 3
# name time movietypewatched
# <fct> <fct> <int>
# 1 john 1-2018 2
# 2 john 2-2018 3
# 3 kate 1-2018 1
# 4 kate 2-2018 2
这是遵循相同逻辑的data.table
解决方案。
library(data.table)
setDT(df)
df2 <- df[!duplicated(df[, .(moviewatched, name)])][
, .(movietypewatched = uniqueN(moviewatched)), by = .(name, time)][
, movietypewatched := cumsum(movietypewatched), by = name]
df2[]
# name time movietypewatched
# 1: john 1-2018 2
# 2: john 2-2018 3
# 3: kate 1-2018 1
# 4: kate 2-2018 2
答案 1 :(得分:3)
First convert the time data to a class to establish order, e.g. with lubridate::myd
with truncated = 1
. From here, set the arrangement of rows to ensure they're in order, then, grouped by name
, use purrr::accumulate
to generate a list of unique values seen so far in moviewatched
, called upon which lengths
will return the number of movies seen to that point. Aggregate by month with max
to get the total cumulative types for each month.
library(tidyverse)
df <- data_frame(
moviewatched = c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama'),
name = c('john', 'john', 'john', 'john', 'john','kate','kate'),
time = lubridate::myd(c('1-2018', '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018'), truncated = 1)
)
df %>%
group_by(name) %>%
arrange(name, time) %>%
mutate(n_types = lengths(accumulate(moviewatched, ~unique(c(...))))) %>%
group_by(name, time) %>%
summarise(n_types = max(n_types))
#> # A tibble: 4 x 3
#> # Groups: name [2]
#> name time n_types
#> <chr> <date> <dbl>
#> 1 john 2018-01-01 2
#> 2 john 2018-02-01 3
#> 3 kate 2018-01-01 1
#> 4 kate 2018-02-01 2
答案 2 :(得分:3)
Make a table of first dates watched; count by month; and take the cumulative sum:
library(data.table)
setDT(df)
# fix bad date
df[, d := as.IDate(paste(time, "01", sep="-"), "%m-%Y-%d")]
# identify month first watched
fw = df[, .(d = min(d)), by=.(name, moviewatched)]
# count new movies per month
nm = fw[, .N, keyby=.(name, d)]
# take cumulative count
nm[, cN := cumsum(N), by=name]
name d N cN
1: john 2018-01-01 2 2
2: john 2018-02-01 1 3
3: kate 2018-01-01 1 1
4: kate 2018-02-01 1 2
You need to convert the date; otherwise the min() will be incorrect and/or broken.
There are two aggregation steps here, but the code should be fast thanks to optimization in data.table (see ?GForce
).
答案 3 :(得分:2)
Using data.table
:
library(data.table)
df <- unique(df)
setDT(df)[, movietypewatched := 1:.N, by = c("moviewatched", "name")]
df <- df[!(movietypewatched == 2), ]
df[, movietypewatched := .N, by = c("name", "time")][, moviewatched := NULL]
df <- unique(df)
df[, movietypewatched := cumsum(movietypewatched), by = name]
name time movietypewatched
1: john 1-2018 2
2: john 2-2018 3
3: kate 1-2018 1
4: kate 2-2018 2
答案 4 :(得分:0)
在这里,如果您想在genre_all
中获得唯一值并在genre_count
中获得计数,则可以采取中间步骤。
请注意:
name, date
排列数据框以累积值。lag()
来获取先前的值。由于每个name
的第一个条目都没有先前的值,因此它将得到NA
。n_distinct()
计算唯一类型时,您将需要删除NA。>
library(dplyr)
library(purrr)
library(tidyr)
moviewatched <- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama')
name <- c('john', 'john', 'john', 'john','kate','kate', 'john')
time <- c( '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018','1-2018')
df <- data.frame(moviewatched, name, time)
df_final <- df %>%
arrange(name, time) %>%
group_by(name, time) %>%
nest(.key= 'genre') %>%
group_by(name) %>%
mutate(genre_all = map2(genre, lag(genre), rbind) %>% map(unique)) %>%
ungroup() %>%
mutate(genre_count = map_int(genre_all, ~ lift(n_distinct)(.x, na.rm =TRUE)))
结果:
> df_final
# A tibble: 4 x 5
name time genre genre_all genre_count
<fct> <fct> <list> <list> <int>
1 john 1-2018 <tibble [3 x 1]> <tibble [3 x 1]> 2
2 john 2-2018 <tibble [2 x 1]> <tibble [3 x 1]> 3
3 kate 1-2018 <tibble [1 x 1]> <tibble [2 x 1]> 1
4 kate 2-2018 <tibble [1 x 1]> <tibble [2 x 1]> 2