具有多个条件的单词出现的总和

时间:2019-01-22 15:33:32

标签: r

我希望获得每家公司在过去五年中生产的水果的总和,并用所有这些总金额构建一个新列。

例如:在2016年, company_b 生产了个苹果;橘子;梨,而前5年company_b分别生产(2011年:苹果,橙子,香蕉)和(2014年:橙色;梨)。通过计算对应于重点年份(2016年)的前五年所生产的水果数量,我们得到 4

在寻找答案的过程中,我只看到了类似这篇文章R: calculate the number of occurrences of a specific event in a specified time future的数字的总和。但是,我需要计算在过去五年中任何给定公司的所有单词的出现。

任何帮助将不胜感激,也欢迎使用dplyr的任何解决方案! :)

df <- data.frame(company=c("company_a","company_b","company_b", "company_a","company_b","company_a"), 
             fruit=c("peaches, apples; oranges","apples; oranges; bananas","oranges; pears","bananas; apples; oranges; pears","apples; oranges; pears","bananas; apples; oranges; pears; peaches"),
             year=c("2010","2011","2014","2014", "2016","2018"))    

> df
    company                                    fruit year
1 company_a                 peaches, apples; oranges 2010
2 company_b                 apples; oranges; bananas 2011
3 company_b                           oranges; pears 2014
4 company_a          bananas; apples; oranges; pears 2014
5 company_b                   apples; oranges; pears 2016
6 company_a bananas; apples; oranges; pears; peaches 2018

结果列应如下所示:

df <-  cbind(df, c("0","0","1","2","4","4") 

company                                    fruit year      sum_occurrences
1 company_a                 peaches, apples; oranges 2010               0
2 company_b                 apples; oranges; bananas 2011               0
3 company_b                           oranges; pears 2014               1
4 company_a          bananas; apples; oranges; pears 2014               2
5 company_b                   apples; oranges; pears 2016               4
6 company_a bananas; apples; oranges; pears; peaches 2018               4       

1 个答案:

答案 0 :(得分:1)

# clean up column classes
df[] <- lapply(df, as.character)
df$year <- as.numeric(df$year)

library(data.table)
setDT(df)

# create separate column for vector of fruits, and year + 5 column
df[, fruit2 := strsplit(gsub(' ', '', fruit), ',|;')]
df[, year2 := year + 5]

# Self join so for each row of df, this creates one row for each time another  
# row is within the year range 
df2 <- df[df, on = .(year <= year2, year > year, company = company)
          , .(company, fruit, fruit2, i.fruit2, year = x.year)]

# For each row in the (company, fruit, year) group, check whether 
# the original fruits are  in the matching rows' fruits, and store the result
# as a logical vector. Then sum the list of logical vectors (one for each row).
df3 <- df2[, .(sum_occurrences = do.call(sum, Map(`%in%`, fruit2, i.fruit2)))
           , by = .(company, fruit, year)]

# Add sum_occurrences to original df with join, and make NAs 0
df[df3, on = .(company, fruit, year), sum_occurrences := i.sum_occurrences]
df[is.na(sum_occurrences), sum_occurrences := 0]

#delete temp columns
df[, `:=`(fruit2 = NULL, year2 = NULL)]

结果

df


#      company                                    fruit year sum_occurrences
# 1: company_a                 peaches, apples; oranges 2010               0
# 2: company_b                 apples; oranges; bananas 2011               0
# 3: company_b                           oranges; pears 2014               1
# 4: company_a          bananas; apples; oranges; pears 2014               2
# 5: company_b                   apples; oranges; pears 2016               4
# 6: company_a bananas; apples; oranges; pears; peaches 2018               4