每个人的好日子,
我在完成这项具有挑战性的任务时遇到了麻烦,我希望找到一种优雅的方法:
基本上面临的挑战是,不应该针对数据框中的任何特定元素修复该方法。
第一个数据帧:
df1 <- data.frame(zone, country_name)
zone = c("M", "N", "O")
country_name = c("The USA, Canada & Mexico are part of North America", "Canada like Australia is a Commonwealth member", "The UK is still finalizing its exit plans from the EU")
第二个数据帧:
df2 <- data.frame(zonal_region, country, population)
zonal_region = c("M", "M", "M", "N", "N", "N", "O", "O", "O")
country = c("USA", "Canada", "Mexico", "Canada", "Australia", "UK", "Australia", "UK", "Canada")
population = c(323.4 , 36.29, 127.5, 36.29, 24.13, 65.64, 24.13, 65.64, 36.29)
这是我的最终输出结果:
df3 <- data.frame(zone, country_name, total_population)
zone = c("M", "N", "O")
country_name = c("The USA, Canada & Mexico are part of North America", "Canada like Australia is a Commonwealth member", "The UK is still finalizing its exit plans from the EU")
total_population = c(487.19, 60.42, 65.64)
我一直无法提取多个子字符串并根据其区域对df2索引其值。
如果有人能解决这个问题,我们将非常感激。
谢谢!
答案 0 :(得分:1)
我们可以通过在提取“国家/地区”后对这两个数据集进行left/right
联接来实现此目的。来自&#39; country_name&#39; &#39; df1&#39;列,并执行group_by
sum
library(tidyverse)
un1 <- unique(df2$country)
df1 %>%
mutate(cntry = str_extract_all(country_name, paste(un1, collapse="|"))) %>%
right_join(df2, by = c('zone' = 'zonal_region')) %>%
group_by(zone) %>%
summarize(total_population= sum(population[country %in% cntry[[1]]])) %>%
left_join(df1) %>%
select(zone, country_name, total_population)
# A tibble: 3 x 3
# zone country_name total_population
<fct> <fct> <dbl>
#1 M The USA, Canada & Mexico are part of North America 487.
#2 N Canada like Australia is a Commonwealth member 60.4
#3 O The UK is still finalizing its exit plans from the EU 65.6
答案 1 :(得分:1)
您可以尝试fuzzyjoin
library(dplyr)
library(stringr)
library(fuzzyjoin)
df1 %>%
mutate_if(is.factor, as.character) %>%
fuzzy_left_join((df2 %>% mutate_if(is.factor, as.character)),
by = c("zone" = "zonal_region", "country_name" = "country"),
match_fun = str_detect) %>%
group_by(zone, country_name) %>%
summarise(total_population = sum(population)) %>%
data.frame()
输出为:
zone country_name total_population
1 M The USA, Canada & Mexico are part of North America 487.19
2 N Canada like Australia is a Commonwealth member 60.42
3 O The UK is still finalizing its exit plans from the EU 65.64
示例数据:
df1 <- structure(list(zone = structure(1:3, .Label = c("M", "N", "O"
), class = "factor"), country_name = structure(c(3L, 1L, 2L), .Label = c("Canada like Australia is a Commonwealth member",
"The UK is still finalizing its exit plans from the EU", "The USA, Canada & Mexico are part of North America"
), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(zonal_region = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("M", "N", "O"), class = "factor"),
country = structure(c(5L, 2L, 3L, 2L, 1L, 4L, 1L, 4L, 2L), .Label = c("Australia",
"Canada", "Mexico", "UK", "USA"), class = "factor"), population = c(323.4,
36.29, 127.5, 36.29, 24.13, 65.64, 24.13, 65.64, 36.29)), class = "data.frame", row.names = c(NA,
-9L))