创建新变量,以考虑先前记录中的先前信息

时间:2018-09-18 18:47:08

标签: r dplyr data.table tidyr

我具有以下数据,并且我想创建一个新变量,该变量考虑到前一时期的先前信息。例如,

moviewatched<- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama')
name<- c('john', 'john', 'john', 'john', 'john','kate','kate')
time<- c('1-2018', '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018')


df<- data.frame(moviewatched, name, time)

现在,我需要创建一个变量,以告诉他/她当月观看的新类型的流派电影。例如,在上述情况下,John在2018年的第一个月观看了2种类型的类型,并在第二个月观看了1种新的类型(因为他在第一个月已经观看了喜剧和恐怖片),我有什么方法可以创造一个人开始观看的新类型的连续计数?我想创建一个名为movietypewatched的变量,其中包含该人到该月为止所观看的所有流派类型的总数。预期的输出如下:

     name time   movietypewatched 
     john 1-2018       2
     john 2-2018       3
     kate 1-2018       1
     kate 2-2018       2

谢谢

5 个答案:

答案 0 :(得分:3)

使用dplyr的解决方案。我们可以基于moviewatchedname删除重复的行,对唯一的moviewatched进行计数,然后使用cumsum计算运行总计。 df2是最终输出。

library(dplyr)

df2 <- df %>%
  distinct(moviewatched, name, .keep_all = TRUE) %>%
  group_by(name, time) %>%
  summarise(movietypewatched = n_distinct(moviewatched)) %>%
  mutate(movietypewatched = cumsum(movietypewatched)) %>%
  ungroup()
df2
# # A tibble: 4 x 3
#   name  time   movietypewatched
#   <fct> <fct>             <int>
# 1 john  1-2018                2
# 2 john  2-2018                3
# 3 kate  1-2018                1
# 4 kate  2-2018                2

这是遵循相同逻辑的data.table解决方案。

library(data.table)

setDT(df)
df2 <- df[!duplicated(df[, .(moviewatched, name)])][
  , .(movietypewatched = uniqueN(moviewatched)), by = .(name, time)][
    , movietypewatched := cumsum(movietypewatched), by = name]
df2[]
#    name   time movietypewatched
# 1: john 1-2018                2
# 2: john 2-2018                3
# 3: kate 1-2018                1
# 4: kate 2-2018                2

答案 1 :(得分:3)

First convert the time data to a class to establish order, e.g. with lubridate::myd with truncated = 1. From here, set the arrangement of rows to ensure they're in order, then, grouped by name, use purrr::accumulate to generate a list of unique values seen so far in moviewatched, called upon which lengths will return the number of movies seen to that point. Aggregate by month with max to get the total cumulative types for each month.

library(tidyverse)

df <- data_frame(
    moviewatched =  c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama'),
    name =  c('john', 'john', 'john', 'john', 'john','kate','kate'),
    time =  lubridate::myd(c('1-2018', '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018'), truncated = 1)
)

df %>% 
    group_by(name) %>% 
    arrange(name, time) %>%
    mutate(n_types = lengths(accumulate(moviewatched, ~unique(c(...))))) %>% 
    group_by(name, time) %>% 
    summarise(n_types = max(n_types))
#> # A tibble: 4 x 3
#> # Groups:   name [2]
#>   name  time       n_types
#>   <chr> <date>       <dbl>
#> 1 john  2018-01-01       2
#> 2 john  2018-02-01       3
#> 3 kate  2018-01-01       1
#> 4 kate  2018-02-01       2

答案 2 :(得分:3)

Make a table of first dates watched; count by month; and take the cumulative sum:

library(data.table)
setDT(df)

# fix bad date
df[, d := as.IDate(paste(time, "01", sep="-"), "%m-%Y-%d")]

# identify month first watched
fw = df[, .(d = min(d)), by=.(name, moviewatched)]

# count new movies per month
nm = fw[, .N, keyby=.(name, d)]

# take cumulative count
nm[, cN := cumsum(N), by=name]

   name          d N cN
1: john 2018-01-01 2  2
2: john 2018-02-01 1  3
3: kate 2018-01-01 1  1
4: kate 2018-02-01 1  2

You need to convert the date; otherwise the min() will be incorrect and/or broken.

There are two aggregation steps here, but the code should be fast thanks to optimization in data.table (see ?GForce).

答案 3 :(得分:2)

Using data.table:

library(data.table)
df <- unique(df) 
setDT(df)[, movietypewatched := 1:.N, by = c("moviewatched", "name")] 
df <- df[!(movietypewatched == 2), ]
df[, movietypewatched := .N, by = c("name", "time")][, moviewatched := NULL]
df <- unique(df)
df[, movietypewatched := cumsum(movietypewatched), by = name]

   name   time movietypewatched
1: john 1-2018                2
2: john 2-2018                3
3: kate 1-2018                1
4: kate 2-2018                2

答案 4 :(得分:0)

在这里,如果您想在genre_all中获得唯一值并在genre_count中获得计数,则可以采取中间步骤。

请注意:

  • 您需要按name, date排列数据框以累积值。
  • 您可以使用lag()来获取先前的值。由于每个name的第一个条目都没有先前的值,因此它将得到NA
  • 使用n_distinct()计算唯一类型时,您将需要删除NA。

>

library(dplyr)
library(purrr)
library(tidyr)

moviewatched <- c('Comedy', 'Horror', 'Comedy', 'Horror', 'Drama', 'Comedy', 'Drama')
name <- c('john', 'john', 'john', 'john','kate','kate', 'john')
time <- c( '1-2018', '1-2018', '2-2018', '2-2018','1-2018' ,'2-2018','1-2018')

df <- data.frame(moviewatched, name, time)


df_final <- df %>% 
  arrange(name, time) %>% 
  group_by(name, time) %>%
  nest(.key= 'genre') %>% 
  group_by(name) %>% 
  mutate(genre_all = map2(genre, lag(genre), rbind) %>% map(unique)) %>% 
  ungroup() %>% 
  mutate(genre_count = map_int(genre_all, ~ lift(n_distinct)(.x, na.rm =TRUE)))

结果:

> df_final
# A tibble: 4 x 5
  name  time   genre            genre_all        genre_count
  <fct> <fct>  <list>           <list>                 <int>
1 john  1-2018 <tibble [3 x 1]> <tibble [3 x 1]>           2
2 john  2-2018 <tibble [2 x 1]> <tibble [3 x 1]>           3
3 kate  1-2018 <tibble [1 x 1]> <tibble [2 x 1]>           1
4 kate  2-2018 <tibble [1 x 1]> <tibble [2 x 1]>           2