我希望获得每家公司在过去五年中生产的水果的总和,并用所有这些总金额构建一个新列。
例如:在2016年, company_b 生产了个苹果;橘子;梨,而前5年company_b分别生产(2011年:苹果,橙子,香蕉)和(2014年:橙色;梨)。通过计算对应于重点年份(2016年)的前五年所生产的水果数量,我们得到 4 。
在寻找答案的过程中,我只看到了类似这篇文章R: calculate the number of occurrences of a specific event in a specified time future的数字的总和。但是,我需要计算在过去五年中任何给定公司的所有单词的出现。
任何帮助将不胜感激,也欢迎使用dplyr的任何解决方案! :)
df <- data.frame(company=c("company_a","company_b","company_b", "company_a","company_b","company_a"),
fruit=c("peaches, apples; oranges","apples; oranges; bananas","oranges; pears","bananas; apples; oranges; pears","apples; oranges; pears","bananas; apples; oranges; pears; peaches"),
year=c("2010","2011","2014","2014", "2016","2018"))
> df
company fruit year
1 company_a peaches, apples; oranges 2010
2 company_b apples; oranges; bananas 2011
3 company_b oranges; pears 2014
4 company_a bananas; apples; oranges; pears 2014
5 company_b apples; oranges; pears 2016
6 company_a bananas; apples; oranges; pears; peaches 2018
结果列应如下所示:
df <- cbind(df, c("0","0","1","2","4","4")
company fruit year sum_occurrences
1 company_a peaches, apples; oranges 2010 0
2 company_b apples; oranges; bananas 2011 0
3 company_b oranges; pears 2014 1
4 company_a bananas; apples; oranges; pears 2014 2
5 company_b apples; oranges; pears 2016 4
6 company_a bananas; apples; oranges; pears; peaches 2018 4
答案 0 :(得分:1)
# clean up column classes
df[] <- lapply(df, as.character)
df$year <- as.numeric(df$year)
library(data.table)
setDT(df)
# create separate column for vector of fruits, and year + 5 column
df[, fruit2 := strsplit(gsub(' ', '', fruit), ',|;')]
df[, year2 := year + 5]
# Self join so for each row of df, this creates one row for each time another
# row is within the year range
df2 <- df[df, on = .(year <= year2, year > year, company = company)
, .(company, fruit, fruit2, i.fruit2, year = x.year)]
# For each row in the (company, fruit, year) group, check whether
# the original fruits are in the matching rows' fruits, and store the result
# as a logical vector. Then sum the list of logical vectors (one for each row).
df3 <- df2[, .(sum_occurrences = do.call(sum, Map(`%in%`, fruit2, i.fruit2)))
, by = .(company, fruit, year)]
# Add sum_occurrences to original df with join, and make NAs 0
df[df3, on = .(company, fruit, year), sum_occurrences := i.sum_occurrences]
df[is.na(sum_occurrences), sum_occurrences := 0]
#delete temp columns
df[, `:=`(fruit2 = NULL, year2 = NULL)]
结果
df
# company fruit year sum_occurrences
# 1: company_a peaches, apples; oranges 2010 0
# 2: company_b apples; oranges; bananas 2011 0
# 3: company_b oranges; pears 2014 1
# 4: company_a bananas; apples; oranges; pears 2014 2
# 5: company_b apples; oranges; pears 2016 4
# 6: company_a bananas; apples; oranges; pears; peaches 2018 4