Question

我在数据框中有一个变量，其字段名称为“ Destination”。此字段包含目的地/地点（可以是国家/地区，大洲，多个国家/地区，城市，城市等，或两者都有）。我有另一个数据框，其中包含3列continent_name，country_name，city_name等。我想通过将目标字段与2个数据框列进行匹配来获取具有大洲，国家/地区和城市名称的新列。

数据框A：

+---------+------------------------------------+
|  Name   |            Destination             |
+---------+------------------------------------+
| Alex    | North America, Europe & France     |
| Mike    | Boston, London, Germany, Australia |
| Charlie | China, Europe, India, New York     |
| Lophy   | Antartica, UK, Europe, Delhi       |
+---------+------------------------------------+

数据框B：

---------------+-----------+----------+
|   Continent   |  Country  |   City   |
+---------------+-----------+----------+
| north america | france    | boston   |
| anatartica    | germany   | london   |
| europe        | australia | delhi    |
| XYZ           | china     | new york |
| ABC           | india     | RST      |
| PQR           | UK        | JKL      |
+---------------+-----------+----------+

预期输出：

+---------+-----------------------+--------------------+----------------+
|  Name   |       Continent       |      Country       |      City      |
+---------+-----------------------+--------------------+----------------+
| Alex    | North America, Europe | France             |                |
| Mike    | NA                    | Germany, Australia | Boston, London |
| Charlie | Europe                | China, India       | New York       |
| Lophy   | Antartica, Europe     | UK                 | Delhi          |
+---------+-----------------------+--------------------+----------------+

首先要匹配所有大洲名称，并在多个匹配的情况下以逗号分隔的值存储，然后是国家名称，然后是城市名称。

我遇到了多个问题，但没有得到具体说明。

Answer 1

最简单的方法是将两个表都放入长格式并将其连接起来，然后使用目标类型返回到宽格式：

library(tidyverse)
B2 <- B %>% 
  gather(type,lower_dest) %>%
  mutate_at("lower_dest", tolower)

A2 <- A %>% 
  separate_rows(Destination,sep="\\s*[,&]\\s*") %>%
  mutate(lower_dest = tolower(Destination))

left_join(A2, B2, by = "lower_dest") %>%
  group_by(Name, type) %>%
  summarize_at("Destination", paste,collapse=", ") %>%
  spread(type, Destination) %>%
  ungroup

# # A tibble: 4 x 4
#      Name           City             Continent            Country
# *   <chr>          <chr>                 <chr>              <chr>
# 1    Alex           <NA> North America, Europe             France
# 2 Charlie       New York                Europe       China, India
# 3   Lophy          Delhi     Antartica, Europe                 UK
# 4    Mike Boston, London                  <NA> Germany, Australia

数据

A <-
  tribble(~Name   , ~Destination ,   
 'Alex'    , 'North America, Europe & France',     
 'Mike'    , 'Boston, London, Germany, Australia', 
 'Charlie' , 'China, Europe, India, New York', 
 'Lophy'   , 'Antartica, UK, Europe, Delhi')     


# anatartica typo corrected into antartica  
B <- tribble(~Continent, ~Country, ~City,
 'north america' , 'france'    , 'boston'   ,
 'antartica'    , 'germany'   , 'london'   ,
 'europe'        , 'australia' , 'delhi'    ,
 'XYZ'           , 'china'     , 'new york' ,
 'ABC'           , 'india'     , 'RST'      ,
 'PQR'           , 'UK'        , 'JKL')

Answer 2

一些可以帮助您的功能：

tolower()会将您的所有单词都转换为小写，以便在混合使用大写字母时进行匹配。 str_split()和stringr可以让您用逗号分隔元素来分隔目的地

因此，首先，您需要获得一个包含所有目的地的向量：

destination_vector <-unique(unlist(strsplit(tolower(Destination), ",")))可以。由于strsplit为您提供了一个列表，因此您需要unlist来获得一个向量。 unique将获得删除重复项。

然后，您需要检查您的目的地是否在大陆，国家或城市：

Continent[Continent %in% destination_vector]可以。国家和城市都一样

然后，您可以将paste与sep=","结合使用，以逗号作为分隔符。

最好！

Answer 3

# data
d <- read.table(text = "Name Destination
Alex 'North America, Europe & France'
Mike 'Boston, London, Germany, Australia'
Charlie 'China, Europe, India, New York'
Lophy 'Antartica, UK, Europe, Delhi'",
                header = TRUE,
                stringsAsFactors = FALSE)
d$Destination <- gsub("&", ",", d$Destination)
d$Destination <- tolower(d$Destination)
d$Destination <- trimws(d$Destination)
d

d2 <- read.table(text = " Continent  Country City
'north america' france boston
anatartica  germany london
europe australia delhi
XYZ china 'new york' 
ABC india RST
PQR UK  JKK", header = TRUE, stringsAsFactors = FALSE)
d2

# splits ..
check_fun <- function(a, b) {
  toString(intersect(trimws(strsplit(d$Destination[a], ",")[[1]], "both"), d2[[b]]))
}

want <- as.data.frame(do.call(cbind,
                              lapply(colnames(d2),
                                     function(x) {
                                       sapply(seq_along(d$Destination),
                                              function(y) {
                                                check_fun(y, x)
                                              }
                                              )
                                       })), stringsAsFactors = FALSE)
colnames(want) <- colnames(d2)
want$Name <- d$Name
want                              

# Continent            Country           City    Name
# 1 north america, europe             france                   Alex
# 2                       germany, australia boston, london    Mike
# 3                europe       china, india       new york Charlie
# 4                europe                             delhi   Lophy

如何从数据框中的不同列中搜索和提取匹配的单词？

3 个答案: