我有一个数据框,看起来像这个玩具数据框:
df <- data.frame(company=c("company_a","company_b","company_b", "company_a","company_b","company_a"),
fruit=c("peaches, apples; oranges","apples; oranges; bananas","oranges; pears","bananas; apples; oranges; pears","apples; oranges; pears","bananas; apples; oranges; pears; peaches"),
year=c("2010","2011","2014","2014", "2016","2018"))
> df
company fruit year
1 company_a peaches; apples; oranges 2010
2 company_b apples; oranges; bananas 2011
3 company_b oranges; pears 2014
4 company_a bananas; apples; oranges; pears 2014
5 company_b apples; oranges; pears 2016
6 company_a bananas; apples; oranges; pears; peaches 2018
所需结果
我想要一列(new_occurrences),上面列出了过去五年中从未出现过的成果。
例如,第4行:company_a =香蕉和梨在过去5年中从未出现,因此new_fruit =2。
它看起来像这样:
> df
company fruit year new_occurrences
1 company_a peaches; apples; oranges 2010 3
2 company_b apples; oranges; bananas 2011 3
3 company_b oranges; pears 2014 1
4 company_a bananas; apples; oranges; pears 2014 2
5 company_b apples; oranges; pears 2016 0
6 company_a bananas; apples; oranges; pears; peaches 2018 1
尝试
我尝试了这个question的答案,为此我创建了一个与'%in%'相反的函数,并在df3中使用了它。
'%!in%' <- function(x,y)!('%in%'(x,y))
# clean up column classes
df[] <- lapply(df, as.character)
df$year <- as.numeric(df$year)
library(data.table)
setDT(df)
# create separate column for vector of fruits, and year + 5 column
df[, fruit2 := strsplit(gsub(' ', '', fruit), ',|;')]
df[, year2 := year + 5]
# Self join so for each row of df, this creates one row for each time another
# row is within the year range
df2 <- df[df, on = .(year <= year2, year > year, company = company)
, .(company, fruit, fruit2, i.fruit2, year = x.year)]
# create a function which is the opposite of '%in%'
'%!in%' <- function(x,y)!('%in%'(x,y))
# For each row in the (company, fruit, year) group, check whether
# the original fruits are in the matching rows' fruits, and store the result
# as a logical vector. Then sum the list of logical vectors (one for each row).
df3 <- df2[, .(new_occurrences = do.call(sum, Map(`%!in%`, fruit2, i.fruit2)))
, by = .(company, fruit, year)]
# Add sum_occurrences to original df with join, and make NAs 0
df[df3, on = .(company, fruit, year), new_occurrences := i.new_occurrences]
df[is.na(new_occurrences), new_occurrences := 0]
#delete temp columns
df[, `:=`(fruit2 = NULL, year2 = NULL)]
不幸的是,这种尝试并没有给我我想要的结果。
任何帮助将不胜感激,也欢迎使用dplyr解决方案! :)
答案 0 :(得分:1)
一次tidyverse
尝试:
library(tidyverse)
years_window <- 5
df %>%
separate_rows(fruit, sep = "; |, ") %>%
mutate(tmp = 1,
year = as.integer(as.character(year))) %>%
complete(company = unique(.$company),
year = (min(year) - years_window):max(year),
fruit = unique(.$fruit)) %>%
arrange(year) %>%
group_by(company, fruit) %>%
mutate(check = zoo::rollapply(tmp,
FUN = function(x) sum(is.na(x)),
width = list(-(1:years_window)),
align = 'right',
fill = NA,
partial = TRUE)) %>%
group_by(company, year) %>%
mutate(new_occurrences = sum(check == years_window & !is.na(tmp))) %>%
filter(!is.na(tmp)) %>%
distinct(company, year, new_occurrences) %>%
arrange(year) %>%
left_join(df %>%
mutate(year = as.integer(as.character(year))),
by = c("company", "year")) %>%
select(company, fruit, year, new_occurrences)
输出:
# A tibble: 6 x 4
# Groups: company, year [6]
company fruit year new_occurrences
<fct> <fct> <int> <int>
1 company_a peaches, apples; oranges 2010 3
2 company_b apples; oranges; bananas 2011 3
3 company_a bananas; apples; oranges; pears 2014 2
4 company_b oranges; pears 2014 1
5 company_b apples; oranges; pears 2016 0
6 company_a bananas; apples; oranges; pears; peaches 2018 1
答案 1 :(得分:1)
假设最后在“注释”中可重复显示输入,请定义两个函数,以分号分隔的字符串转换为向量,然后再次返回。每行的,确定当前公司最近5年中的先前水果,并计算所需的差异。一秒钟transform
计算出新水果的数量。不使用任何软件包。
char2vec <- function(x) scan(text = x, what = "", sep = ";", strip.white = TRUE,
quiet = TRUE)
vec2char <- function(x) paste(x, collapse = "; ")
df2 <- transform(df, new = sapply(1:nrow(df), function(i) {
year0 <- df$year[i]; company0 <- df$company[i]; fruit0 <- df$fruit[i]
prev_fruit <- char2vec(subset(df,
year < year0 & year >= year0 - 5 & company == company0)$fruit)
vec2char(Filter(function(x) !x %in% prev_fruit, char2vec(fruit0)))
}), stringsAsFactors = FALSE)
transform(df2, num_new = lengths(lapply(new, char2vec)))
给予:
company fruit year new num_new
1 company_a peaches; apples; oranges 2010 peaches; apples; oranges 3
2 company_b apples; oranges; bananas 2011 apples; oranges; bananas 3
3 company_b oranges; pears 2014 pears 1
4 company_a bananas; apples; oranges; pears 2014 bananas; pears 2
5 company_b apples; oranges; pears 2016 0
6 company_a bananas; apples; oranges; pears; peaches 2018 peaches 1
这是从问题中提取的。一个逗号变为分号。
df <- data.frame(company=c("company_a","company_b","company_b",
"company_a","company_b","company_a"),
fruit=c("peaches; apples; oranges","apples; oranges; bananas",
"oranges; pears", "bananas; apples; oranges; pears",
"apples; oranges; pears", "bananas; apples; oranges; pears; peaches"),
year = c("2010","2011","2014","2014", "2016","2018"))
df[] <- lapply(df, as.character)
df$year <- as.numeric(df$year)