我想计算过去五年内 animals.1 列中的单词出现在 animals.2 列中的次数:< / p>
> df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)
> df
animals.1 animals.2 year
1 cat; dog; bird cat; dog; bird 2001
2 dog; bird dog; bird; seal 2005
3 bird bird 2010
4 dog 2018
所需的输出
> df
animals.1 animals.2 year count
1 cat; dog; bird cat; dog; bird 2001 3
2 dog; bird dog; bird; seal 2005 4
3 bird bird 2010 1
4 dog 2018 0
修改
在第2行animal.1
= dog; bird
中,出现在前5年的animal.2
= dog; bird
(在2005年)和dog; bird
(在2001年)中。总计数= 4
在第3行animals.1
= bird
中,前五年出现在animal.2
= bird
列中(2010年),而2005年不在我的五年范围内。总计数= 1
在上一个post中,我仅在没有年份条件的情况下问了类似的问题。 但是,年份条件不能添加到提供的解决方案中。
任何帮助将不胜感激:)
答案 0 :(得分:1)
使用base
的{{1}}方式:
mapply()
我假设within(df,
count <- mapply(function(x, y) {
in5year <- paste(animals.2[year %in% (x-4):x], collapse = "; ")
sum(strsplit(in5year, "; ")[[1]] %in% strsplit(y, "; ")[[1]])
}, year, animals.1)
)
# animals.1 animals.2 year count
# 1 cat; dog; bird cat; dog; bird 2001 3
# 2 dog; bird dog; bird; seal 2005 4
# 3 bird bird 2010 1
# 4 dog 2018 0
列是数字。如果没有,请先将其转换为数字。
答案 1 :(得分:1)
您的代码尚未设为机器可读。机器在读取“长”数据以及执行分组和联接操作方面要好得多。
当您寻找x %in% y
时,您将进行很多比较。然后执行字符串操作也会减慢您的速度(拆分字符串必须找到拆分字符串的位置)。我建议您将所有数据转换为长格式,然后将其保留为长格式,直到您需要宽格式以供人类查看为止。但是我以您的格式给您输出,因为问题要求这样做。
下面的大多数代码将您的数据转换为长数据格式。我在代码中采取了额外的步骤,以尝试分解数据进入计算的方式。
library(dplyr)
library(tidyr)
library(stringr)
df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)
# Convert the animal.1 column to long data
animals_1_long <- df %>%
rowwise() %>%
mutate(
animals_1 = str_split(animals.1,"; ")
) %>%
select(animals_1,year) %>%
unnest()
# # A tibble: 7 x 2
# year animals_1
# <chr> <chr>
# 1 2001 cat
# 2 2001 dog
# 3 2001 bird
# 4 2005 dog
# 5 2005 bird
# 6 2010 bird
# 7 2018 dog
# Similarly convert the animal.2 column to long data
animals_2_long <- df %>%
rowwise() %>%
mutate(
animals_2 = str_split(animals.2,"; ")
) %>%
select(animals_2,year) %>%
unnest()
# Since we want to match for the last 5 years, create a match index for year-4 to year.
animals_2_long_extend_5yrs <- animals_2_long %>%
rename(index_year = year) %>%
rowwise() %>%
mutate(match_year = list(as.character((as.numeric(index_year)-4):as.numeric(index_year)))) %>%
unnest()
# # A tibble: 40 x 3
# index_year animals_2 match_year
# <chr> <chr> <chr>
# 1 2001 cat 1997
# 2 2001 cat 1998
# 3 2001 cat 1999
# 4 2001 cat 2000
# 5 2001 cat 2001
# 6 2001 dog 1997
# 7 2001 dog 1998
# 8 2001 dog 1999
# 9 2001 dog 2000
# 10 2001 dog 2001
在这一点上,animal_1数据的格式很长,每行一年。 animal_2数据采用长格式,每行一个动物/ match_year / index_year。这样一来,第二个数据集就可以在一次连接中覆盖过去5年的全部时间,然后将其汇总为我们最初感兴趣的年份。
将两个长数据集结合在一起,只留下年份与match_year匹配且动物名称与之匹配的行。然后,对index_year中剩余的行数求和是不重要的。
# Join the long data and the long data with the extended match index
animal_check <- animals_1_long %>%
rename(match_year = year) %>%
left_join(animals_2_long_extend_5yrs) %>%
filter(animals_1 == animals_2) %>%
# group by the index year and summarize the count
group_by(index_year) %>%
summarise(count = n()) %>%
rename(year = index_year)
# # A tibble: 3 x 2
# year count
# <chr> <int>
# 1 2001 3
# 2 2005 4
# 3 2010 1
至此,计算完成。剩下的就是将动物的计数加回到数据上。
# Join the yearly result back to the original dataframe
df <- df %>%
left_join(animal_check)
df
# animals.1 animals.2 year count
# 1 cat; dog; bird cat; dog; bird 2001 3
# 2 dog; bird dog; bird; seal 2005 4
# 3 bird bird 2010 1
# 4 dog 2018 NA
更新:
# Data for benchmark:
df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"),
animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),
stringsAsFactors = F)
df <- replicate(10000,{df}, simplify=F) %>% do.call(rbind, .)
df$year <- as.character(seq(2000,2000 + nrow(df) - 1))
# microbenchmark results
min lq mean median uq max neval
5.785196 5.950748 6.642028 6.981055 7.001854 7.491287 5