两列之间的Grepl匹配

时间:2020-11-06 05:54:29

标签: r string string-matching grepl

我有两个数据集,一个数据集带有一个地址列,另一个数据集包含地点名称及其对应的纬度和经度。

商店的数据集:

+--------------------+-----------+--------------------------------------------------+
|     Store name     | Postcodes |                     Address                      |
+--------------------+-----------+--------------------------------------------------+
| Floral showers     |      2000 | Street 45, Level 9, Sydney, New South Wales 2000 |
| Cookie box         |      4300 | Shop 3, Queensland 4300                          |
| Mango troopers     |      2010 | Aberdeen, Bankstown, NSW                         |
| Building AE44      |      4300 | 778/9 Goulburn Street, QLD                       |
| Floral showers Co. |      2230 | Steert 47 Cronulla, New South Wales 2230         |
| Vinci supplies     |      2560 | West AIRDS, Mayfaille NSW                        |
+--------------------+-----------+--------------------------------------------------+

最新信息的数据集:


+-------------------+-------+-------------+--------------+
|     Locality      | State |  Latitude   |  Longitude   |
+-------------------+-------+-------------+--------------+
| ABERDARE          | NSW   |  151.317476 |   -32.977861 |
| ABERDEEN          | NSW   |  151.102917 |    -32.14622 |
| ACACIA PLATEAU    | NSW   |   152.49765 |    -28.36456 |
| AIRDS             | NSW   |  150.768408 |   -34.194216 |
| ADAMINABY         | NSW   |  148.769744 |   -35.997349 |
| ABERCROMBIE RIVER | NSW   | 149.3476918 | -33.91030648 |
| CRONULLA          | NSW   |  151.136596 |   -34.093213 |
| SYDNEY            | NSW   |  151.268071 |   -33.794883 |
+-------------------+-------+-------------+--------------+

我想创建一个新列,以从地址列中获取每个商店的位置,并从其他数据集中填充纬度和经度。由于地址不是固定格式,因此我知道必须进行字符串搜索。但是,我不确定如何在两列之间进行比较。

以下是两个示例dput输出:

 structure(list(Stores_names = c("Floral showers", "Cookie box", 
"Mango troopers", "Building AE44", "Floral showers Co.", "Vinci supplies"
), Postcodes = c("2000", "4300", "2010", "4300", "2230", "2560"
), Address = c("Street 45, Level 9, Sydney, New South Wales 2000", 
"Shop 3, Queensland 4300", "Aberdeen, Bankstown, NSW", "778/9 Goulburn Street, QLD", 
"Steert 47 Cronulla, New South Wales 2230", "West AIRDS, Mayfaille NSW"
)), class = "data.frame", row.names = c(NA, -6L))

structure(list(Localities = c("ABERDARE", "ABERDEEN", "ACACIA PLATEAU", 
"AIRDS", "ADAMINABY", "ABERCROMBIE RIVER", "CRONULLA", "SYDNEY"
), State = c("NSW", "NSW", "NSW", "NSW", "NSW", "NSW", "NSW", 
"NSW"), lat = c("151.317476", "151.102917", "152.49765", "150.768408", 
"148.769744", "149.3476918", "151.136596", "151.268071"), long = c("-32.977861", 
"-32.14622", "-28.36456", "-34.194216", "-35.997349", "-33.91030648", 
"-34.093213", "-33.794883")), class = "data.frame", row.names = c(NA, 
-8L))

我的最终数据集应包含三个新列:位置,纬度和经度。


+--------------------+-----------+--------------------------------------------------+----------+------------+------------+
|     Store name     | Postcodes |                     Address                      | Locality |    lat     |    long    |
+--------------------+-----------+--------------------------------------------------+----------+------------+------------+
| Floral showers     |      2000 | Street 45, Level 9, Sydney, New South Wales 2000 | Sydney   | 151.268071 | -33.794883 |
| Cookie box         |      4300 | Shop 3, Queensland 4300                          |          |            |            |
| Mango troopers     |      2010 | Aberdeen, Bankstown, NSW                         | Aberdeen | 151.102917 |  -32.14622 |
| Building AE44      |      4300 | 778/9 Goulburn Street, QLD                       |          |            |            |
| Floral showers Co. |      2230 | Steert 47 Cronulla, New South Wales 2230         | Cronulla | 151.136596 | -34.093213 |
| Vinci supplies     |      2560 | West AIRDS, Mayfaille NSW                        | AIRDS    | 150.768408 | -34.194216 |
+--------------------+-----------+--------------------------------------------------+----------+------------+------------+

在lat long集中找不到的那些可以保留为空白,但是我需要来自store数据集的所有数据。

感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

这项工作:

library(stringr)
library(dplyr)
df %>% mutate(city = str_extract(toupper(Address),paste0(df1$Localities, collapse = '|'))) %>% 
left_join(df1, by = c("city"="Localities"), keep = T) %>% select(-c(city,State))
        Stores_names Postcodes                                          Address Localities        lat       long
1     Floral showers      2000 Street 45, Level 9, Sydney, New South Wales 2000     SYDNEY 151.268071 -33.794883
2         Cookie box      4300                          Shop 3, Queensland 4300       <NA>       <NA>       <NA>
3     Mango troopers      2010                         Aberdeen, Bankstown, NSW   ABERDEEN 151.102917  -32.14622
4      Building AE44      4300                       778/9 Goulburn Street, QLD       <NA>       <NA>       <NA>
5 Floral showers Co.      2230         Steert 47 Cronulla, New South Wales 2230   CRONULLA 151.136596 -34.093213
6     Vinci supplies      2560                        West AIRDS, Mayfaille NSW      AIRDS 150.768408 -34.194216
>