Question

我想计算过去五年内 animals.1 列中的单词出现在 animals.2 列中的次数：< / p>

> df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)
> df
       animals.1       animals.2 year
1 cat; dog; bird  cat; dog; bird 2001
2      dog; bird dog; bird; seal 2005
3           bird            bird 2010
4            dog                 2018

所需的输出

> df
       animals.1       animals.2 year count
1 cat; dog; bird  cat; dog; bird 2001     3
2      dog; bird dog; bird; seal 2005     4
3           bird            bird 2010     1
4            dog                 2018     0

修改

在第2行animal.1 = dog; bird中，出现在前5年的animal.2 = dog; bird（在2005年）和dog; bird（在2001年）中。总计数= 4

在第3行animals.1 = bird中，前五年出现在animal.2 = bird列中（2010年），而2005年不在我的五年范围内。总计数= 1

在上一个post中，我仅在没有年份条件的情况下问了类似的问题。但是，年份条件不能添加到提供的解决方案中。

任何帮助将不胜感激：）

Answer 1

使用base的{{1}}方式：

mapply()

我假设within(df, count <- mapply(function(x, y) { in5year <- paste(animals.2[year %in% (x-4):x], collapse = "; ") sum(strsplit(in5year, "; ")[[1]] %in% strsplit(y, "; ")[[1]]) }, year, animals.1) ) # animals.1 animals.2 year count # 1 cat; dog; bird cat; dog; bird 2001 3 # 2 dog; bird dog; bird; seal 2005 4 # 3 bird bird 2010 1 # 4 dog 2018 0列是数字。如果没有，请先将其转换为数字。

Answer 2

您的代码尚未设为机器可读。机器在读取“长”数据以及执行分组和联接操作方面要好得多。

当您寻找x %in% y时，您将进行很多比较。然后执行字符串操作也会减慢您的速度（拆分字符串必须找到拆分字符串的位置）。我建议您将所有数据转换为长格式，然后将其保留为长格式，直到您需要宽格式以供人类查看为止。但是我以您的格式给您输出，因为问题要求这样做。

下面的大多数代码将您的数据转换为长数据格式。我在代码中采取了额外的步骤，以尝试分解数据进入计算的方式。

library(dplyr)
library(tidyr)
library(stringr)

df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)

# Convert the animal.1 column to long data
animals_1_long <- df %>%
  rowwise() %>%
  mutate(
    animals_1 = str_split(animals.1,"; ")
  ) %>%
  select(animals_1,year) %>%
  unnest()
# # A tibble: 7 x 2
#   year  animals_1
#  <chr> <chr>    
# 1 2001  cat      
# 2 2001  dog      
# 3 2001  bird     
# 4 2005  dog      
# 5 2005  bird     
# 6 2010  bird     
# 7 2018  dog 

# Similarly convert the animal.2 column to long data
animals_2_long <- df %>%
  rowwise() %>%
  mutate(
    animals_2 = str_split(animals.2,"; ")
  ) %>%
  select(animals_2,year) %>%
  unnest()

# Since we want to match for the last 5 years, create a match index for year-4 to year.
animals_2_long_extend_5yrs <- animals_2_long %>%
  rename(index_year = year) %>%
  rowwise() %>%
  mutate(match_year = list(as.character((as.numeric(index_year)-4):as.numeric(index_year)))) %>%
  unnest()
# # A tibble: 40 x 3
# index_year animals_2 match_year
#    <chr>      <chr>     <chr>     
# 1  2001       cat       1997      
# 2  2001       cat       1998      
# 3  2001       cat       1999      
# 4  2001       cat       2000      
# 5  2001       cat       2001      
# 6  2001       dog       1997      
# 7  2001       dog       1998      
# 8  2001       dog       1999      
# 9  2001       dog       2000      
# 10 2001       dog       2001

在这一点上，animal_1数据的格式很长，每行一年。 animal_2数据采用长格式，每行一个动物/ match_year / index_year。这样一来，第二个数据集就可以在一次连接中覆盖过去5年的全部时间，然后将其汇总为我们最初感兴趣的年份。

将两个长数据集结合在一起，只留下年份与match_year匹配且动物名称与之匹配的行。然后，对index_year中剩余的行数求和是不重要的。

# Join the long data and the long data with the extended match index
animal_check <- animals_1_long %>%
  rename(match_year = year) %>%
  left_join(animals_2_long_extend_5yrs) %>%
  filter(animals_1 == animals_2) %>%
  # group by the index year and summarize the count
  group_by(index_year) %>%
  summarise(count = n()) %>%
  rename(year = index_year)
# # A tibble: 3 x 2
#   year  count
#   <chr> <int>
# 1 2001      3
# 2 2005      4
# 3 2010      1

至此，计算完成。剩下的就是将动物的计数加回到数据上。

# Join the yearly result back to the original dataframe
df <- df %>%
  left_join(animal_check)
df
#        animals.1       animals.2 year count
# 1 cat; dog; bird  cat; dog; bird 2001     3
# 2      dog; bird dog; bird; seal 2005     4
# 3           bird            bird 2010     1
# 4            dog                 2018    NA

更新：

# Data for benchmark:
df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), 
                animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""), 
                stringsAsFactors = F)

df <- replicate(10000,{df}, simplify=F) %>% do.call(rbind, .)
df$year <- as.character(seq(2000,2000 + nrow(df) - 1))
# microbenchmark results
      min       lq     mean   median       uq      max neval
 5.785196 5.950748 6.642028 6.981055 7.001854 7.491287     5

计算条件中另一个列中某个字符串出现字符串的次数

2 个答案: