我在数据框中有一个变量,其字段名称为“ Destination”。此字段包含目的地/地点(可以是国家/地区,大洲,多个国家/地区,城市,城市等,或两者都有)。我有另一个数据框,其中包含3列continent_name,country_name,city_name等。我想通过将目标字段与2个数据框列进行匹配来获取具有大洲,国家/地区和城市名称的新列。
数据框A:
+---------+------------------------------------+
| Name | Destination |
+---------+------------------------------------+
| Alex | North America, Europe & France |
| Mike | Boston, London, Germany, Australia |
| Charlie | China, Europe, India, New York |
| Lophy | Antartica, UK, Europe, Delhi |
+---------+------------------------------------+
数据框B:
---------------+-----------+----------+
| Continent | Country | City |
+---------------+-----------+----------+
| north america | france | boston |
| anatartica | germany | london |
| europe | australia | delhi |
| XYZ | china | new york |
| ABC | india | RST |
| PQR | UK | JKL |
+---------------+-----------+----------+
预期输出:
+---------+-----------------------+--------------------+----------------+
| Name | Continent | Country | City |
+---------+-----------------------+--------------------+----------------+
| Alex | North America, Europe | France | |
| Mike | NA | Germany, Australia | Boston, London |
| Charlie | Europe | China, India | New York |
| Lophy | Antartica, Europe | UK | Delhi |
+---------+-----------------------+--------------------+----------------+
首先要匹配所有大洲名称,并在多个匹配的情况下以逗号分隔的值存储,然后是国家名称,然后是城市名称。
我遇到了多个问题,但没有得到具体说明。
答案 0 :(得分:2)
最简单的方法是将两个表都放入长格式并将其连接起来,然后使用目标类型返回到宽格式:
library(tidyverse)
B2 <- B %>%
gather(type,lower_dest) %>%
mutate_at("lower_dest", tolower)
A2 <- A %>%
separate_rows(Destination,sep="\\s*[,&]\\s*") %>%
mutate(lower_dest = tolower(Destination))
left_join(A2, B2, by = "lower_dest") %>%
group_by(Name, type) %>%
summarize_at("Destination", paste,collapse=", ") %>%
spread(type, Destination) %>%
ungroup
# # A tibble: 4 x 4
# Name City Continent Country
# * <chr> <chr> <chr> <chr>
# 1 Alex <NA> North America, Europe France
# 2 Charlie New York Europe China, India
# 3 Lophy Delhi Antartica, Europe UK
# 4 Mike Boston, London <NA> Germany, Australia
数据
A <-
tribble(~Name , ~Destination ,
'Alex' , 'North America, Europe & France',
'Mike' , 'Boston, London, Germany, Australia',
'Charlie' , 'China, Europe, India, New York',
'Lophy' , 'Antartica, UK, Europe, Delhi')
# anatartica typo corrected into antartica
B <- tribble(~Continent, ~Country, ~City,
'north america' , 'france' , 'boston' ,
'antartica' , 'germany' , 'london' ,
'europe' , 'australia' , 'delhi' ,
'XYZ' , 'china' , 'new york' ,
'ABC' , 'india' , 'RST' ,
'PQR' , 'UK' , 'JKL')
答案 1 :(得分:0)
一些可以帮助您的功能:
tolower()
会将您的所有单词都转换为小写,以便在混合使用大写字母时进行匹配。
str_split()
和stringr
可以让您用逗号分隔元素来分隔目的地
因此,首先,您需要获得一个包含所有目的地的向量:
destination_vector <-unique(unlist(strsplit(tolower(Destination), ",")))
可以。由于strsplit
为您提供了一个列表,因此您需要unlist
来获得一个向量。 unique
将获得删除重复项。
然后,您需要检查您的目的地是否在大陆,国家或城市:
Continent[Continent %in% destination_vector]
可以。国家和城市都一样
然后,您可以将paste
与sep=","
结合使用,以逗号作为分隔符。
最好!
答案 2 :(得分:0)
# data
d <- read.table(text = "Name Destination
Alex 'North America, Europe & France'
Mike 'Boston, London, Germany, Australia'
Charlie 'China, Europe, India, New York'
Lophy 'Antartica, UK, Europe, Delhi'",
header = TRUE,
stringsAsFactors = FALSE)
d$Destination <- gsub("&", ",", d$Destination)
d$Destination <- tolower(d$Destination)
d$Destination <- trimws(d$Destination)
d
d2 <- read.table(text = " Continent Country City
'north america' france boston
anatartica germany london
europe australia delhi
XYZ china 'new york'
ABC india RST
PQR UK JKK", header = TRUE, stringsAsFactors = FALSE)
d2
# splits ..
check_fun <- function(a, b) {
toString(intersect(trimws(strsplit(d$Destination[a], ",")[[1]], "both"), d2[[b]]))
}
want <- as.data.frame(do.call(cbind,
lapply(colnames(d2),
function(x) {
sapply(seq_along(d$Destination),
function(y) {
check_fun(y, x)
}
)
})), stringsAsFactors = FALSE)
colnames(want) <- colnames(d2)
want$Name <- d$Name
want
# Continent Country City Name
# 1 north america, europe france Alex
# 2 germany, australia boston, london Mike
# 3 europe china, india new york Charlie
# 4 europe delhi Lophy