如何从数据框中的不同列中搜索和提取匹配的单词?

时间:2019-01-31 15:18:56

标签: r

我在数据框中有一个变量,其字段名称为“ Destination”。此字段包含目的地/地点(可以是国家/地区,大洲,多个国家/地区,城市,城市等,或两者都有)。我有另一个数据框,其中包含3列continent_name,country_name,city_name等。我想通过将目标字段与2个数据框列进行匹配来获取具有大洲,国家/地区和城市名称的新列。

数据框A:

+---------+------------------------------------+
|  Name   |            Destination             |
+---------+------------------------------------+
| Alex    | North America, Europe & France     |
| Mike    | Boston, London, Germany, Australia |
| Charlie | China, Europe, India, New York     |
| Lophy   | Antartica, UK, Europe, Delhi       |
+---------+------------------------------------+

数据框B:

---------------+-----------+----------+
|   Continent   |  Country  |   City   |
+---------------+-----------+----------+
| north america | france    | boston   |
| anatartica    | germany   | london   |
| europe        | australia | delhi    |
| XYZ           | china     | new york |
| ABC           | india     | RST      |
| PQR           | UK        | JKL      |
+---------------+-----------+----------+

预期输出:

+---------+-----------------------+--------------------+----------------+
|  Name   |       Continent       |      Country       |      City      |
+---------+-----------------------+--------------------+----------------+
| Alex    | North America, Europe | France             |                |
| Mike    | NA                    | Germany, Australia | Boston, London |
| Charlie | Europe                | China, India       | New York       |
| Lophy   | Antartica, Europe     | UK                 | Delhi          |
+---------+-----------------------+--------------------+----------------+

首先要匹配所有大洲名称,并在多个匹配的情况下以逗号分隔的值存储,然后是国家名称,然后是城市名称。

我遇到了多个问题,但没有得到具体说明。

3 个答案:

答案 0 :(得分:2)

最简单的方法是将两个表都放入长格式并将其连接起来,然后使用目标类型返回到宽格式:

library(tidyverse)
B2 <- B %>% 
  gather(type,lower_dest) %>%
  mutate_at("lower_dest", tolower)

A2 <- A %>% 
  separate_rows(Destination,sep="\\s*[,&]\\s*") %>%
  mutate(lower_dest = tolower(Destination))

left_join(A2, B2, by = "lower_dest") %>%
  group_by(Name, type) %>%
  summarize_at("Destination", paste,collapse=", ") %>%
  spread(type, Destination) %>%
  ungroup

# # A tibble: 4 x 4
#      Name           City             Continent            Country
# *   <chr>          <chr>                 <chr>              <chr>
# 1    Alex           <NA> North America, Europe             France
# 2 Charlie       New York                Europe       China, India
# 3   Lophy          Delhi     Antartica, Europe                 UK
# 4    Mike Boston, London                  <NA> Germany, Australia

数据

A <-
  tribble(~Name   , ~Destination ,   
 'Alex'    , 'North America, Europe & France',     
 'Mike'    , 'Boston, London, Germany, Australia', 
 'Charlie' , 'China, Europe, India, New York', 
 'Lophy'   , 'Antartica, UK, Europe, Delhi')     


# anatartica typo corrected into antartica  
B <- tribble(~Continent, ~Country, ~City,
 'north america' , 'france'    , 'boston'   ,
 'antartica'    , 'germany'   , 'london'   ,
 'europe'        , 'australia' , 'delhi'    ,
 'XYZ'           , 'china'     , 'new york' ,
 'ABC'           , 'india'     , 'RST'      ,
 'PQR'           , 'UK'        , 'JKL')

答案 1 :(得分:0)

一些可以帮助您的功能:

tolower()会将您的所有单词都转换为小写,以便在混合使用大写字母时进行匹配。 str_split()stringr可以让您用逗号分隔元素来分隔目的地

因此,首先,您需要获得一个包含所有目的地的向量:

destination_vector <-unique(unlist(strsplit(tolower(Destination), ",")))可以。由于strsplit为您提供了一个列表,因此您需要unlist来获得一个向量。 unique将获得删除重复项。

然后,您需要检查您的目的地是否在大陆,国家或城市:

Continent[Continent %in% destination_vector]可以。国家和城市都一样

然后,您可以将pastesep=","结合使用,以逗号作为分隔符。

最好!

答案 2 :(得分:0)

# data
d <- read.table(text = "Name Destination
Alex 'North America, Europe & France'
Mike 'Boston, London, Germany, Australia'
Charlie 'China, Europe, India, New York'
Lophy 'Antartica, UK, Europe, Delhi'",
                header = TRUE,
                stringsAsFactors = FALSE)
d$Destination <- gsub("&", ",", d$Destination)
d$Destination <- tolower(d$Destination)
d$Destination <- trimws(d$Destination)
d

d2 <- read.table(text = " Continent  Country City
'north america' france boston
anatartica  germany london
europe australia delhi
XYZ china 'new york' 
ABC india RST
PQR UK  JKK", header = TRUE, stringsAsFactors = FALSE)
d2

# splits ..
check_fun <- function(a, b) {
  toString(intersect(trimws(strsplit(d$Destination[a], ",")[[1]], "both"), d2[[b]]))
}

want <- as.data.frame(do.call(cbind,
                              lapply(colnames(d2),
                                     function(x) {
                                       sapply(seq_along(d$Destination),
                                              function(y) {
                                                check_fun(y, x)
                                              }
                                              )
                                       })), stringsAsFactors = FALSE)
colnames(want) <- colnames(d2)
want$Name <- d$Name
want                              

# Continent            Country           City    Name
# 1 north america, europe             france                   Alex
# 2                       germany, australia boston, london    Mike
# 3                europe       china, india       new york Charlie
# 4                europe                             delhi   Lophy