Question

我正在尝试使用df1中的街道名称来定位sdf2中的邮政编码。 df2包含与df1相同的街道名称，但街道可以与多个邮政编码链接（在几个城市中可以找到相同的街道名称）。要查找返回多个值的邮政编码，我需要采用最常见的结果。找到邮政编码后，我需要将其添加到df1中与街道名称相同的一行中的新列。 df1包含500,000行，df2包含900,000多行

head(AllNICrimeData, 10)
 Month Longitude Latitude             Location            Crime.type
1  2015-01 -6.003289 54.55165      SALISBURY PLACE Anti-social behaviour
2  2015-01 -5.707979 54.59231                      Anti-social behaviour
3  2015-01 -5.815976 54.73161        MILEBUSH PARK Anti-social behaviour
4  2015-01 -6.393411 54.19788 COLLEGE SQUARE NORTH Anti-social behaviour
5  2015-01 -6.251798 54.85970         STAFFA DRIVE Anti-social behaviour
6  2015-01 -7.206893 54.62265    KILLYCLOGHER ROAD Anti-social behaviour
7  2015-01 -5.915793 54.59242      RAVENHILL REACH Anti-social behaviour
8  2015-01 -5.535389 54.48792                      Anti-social behaviour
9  2015-01 -7.322812 54.99940   GREAT JAMES STREET Anti-social behaviour
10 2015-01 -5.954670 54.61568         JAMAICA ROAD Anti-social behaviour

head(CleanNIPostcodeData[, 6:14],)
Number Primary_Thorfare Alt_Thorfare Secondary_Thorfare  Locality          
Townland        Town County Postcode
1    134   WHITEPARK ROAD         <NA>               <NA> BALLINTOY 
BALLINTOY DEMESNE BALLYCASTLE ANTRIM  BT546ND
2     27  PRINCESS STREET         <NA>               <NA>      <NA>         
PORT RUSH    PORTRUSH ANTRIM  BT568AX
3   <NA>   COVEHILL COURT         <NA>               <NA>      <NA>        
GLENAMANUS    PORTRUSH ANTRIM  BT568GL
4    271     OLDPARK ROAD         <NA>               <NA>      <NA>        
TOWN PARKS     BELFAST ANTRIM  BT146QR
5     2A    RAMORE STREET         <NA>               <NA>      <NA>         
PORT RUSH    PORTRUSH ANTRIM  BT568BD
6     52  EGLINTON STREET         <NA>               <NA>      <NA>         
PORT RUSH    PORTRUSH ANTRIM  BT568DY

我需要实现的是在df1中找到与街道关联的df2中的频繁邮政编码，并将邮政编码添加到与df 1中的街道相同的行中的新列。下面的示例显示了位置的位置与多个邮政编码相关联：

table(CleanNIPostcodeData$Postcode[AllNICrimeData$Location[3] == CleanNIPostcodeData$Primary_Thorfare])
BT387PU BT387QR 
 22      64

我已经能够确定如何获得最频繁的邮政编码，当多个邮政编码与某个位置相关联，但我无法使用所有街道的邮政编码创建新列。

names(which.max(table(CleanNIPostcodeData$Postcode[AllNICrimeData$Location[3] == CleanNIPostcodeData$Primary_Thorfare])))

在上面的例子中，我找到了df1中第3个街道名称最常见的邮政编码。输出是邮政编码“BT387QR”

如何获取上面的代码以应用于整个列并在df1中创建并填充新的邮政编码列

预期输出是df1中的新列，其中包含街道名称的匹配邮政编码。

Answer 1

您所需要的只是使用dplyr::left_join加入两个data.frames并获取Postcode

以下结果是修改后的数据显示逻辑。

library(dplyr)
AllNICrimeData %>% left_join(select(CleanNIPostcodeData, Primary_Thorfare,Postcode) , 
by=c("Location" = "Primary_Thorfare"))


#        Month Longitude Latitude             Location            Crime.type Postcode
#   1  2015-01 -6.003289 54.55165      SALISBURY PLACE Anti-social behaviour     <NA>
#   2  2015-01 -5.707979 54.59231                 <NA> Anti-social behaviour     <NA>
#   3  2015-01 -5.815976 54.73161        MILEBUSH PARK Anti-social behaviour     <NA>
#   4  2015-01 -6.393411 54.19788 COLLEGE SQUARE NORTH Anti-social behaviour     <NA>
#   5  2015-01 -6.251798 54.85970         STAFFA DRIVE Anti-social behaviour     <NA>
#   6  2015-01 -7.206893 54.62265    KILLYCLOGHER ROAD Anti-social behaviour     <NA>
#   7  2015-01 -5.915793 54.59242      RAVENHILL REACH Anti-social behaviour     <NA>
#   8  2015-01 -5.535389 54.48792                 <NA> Anti-social behaviour     <NA>
#   9  2015-01 -7.322812 54.99940   GREAT JAMES STREET Anti-social behaviour     <NA>
#   10 2015-01 -5.954670 54.61568         JAMAICA ROAD Anti-social behaviour  BT568DY

如果我必须保留OP提到的Postcode搜索逻辑，那么解决方案可以写成：

AllNICrimeData$newcol <- mapply(function(x)names(which.max(table(CleanNIPostcodeData$Postcode[x == CleanNIPostcodeData$Primary_Thorfare]))),
AllNICrimeData$Location)

数据：

AllNICrimeData <- read.table(text = "Month Longitude Latitude Location Crime.type 1 2015-01 -6.003289 54.55165 ' SALISBURY PLACE' 'Anti-social behaviour' 2 2015-01 -5.707979 54.59231 NA 'Anti-social behaviour' 3 2015-01 -5.815976 54.73161 'MILEBUSH PARK' 'Anti-social behaviour' 4 2015-01 -6.393411 54.19788 'COLLEGE SQUARE NORTH' 'Anti-social behaviour' 5 2015-01 -6.251798 54.85970 'STAFFA DRIVE' 'Anti-social behaviour' 6 2015-01 -7.206893 54.62265 'KILLYCLOGHER ROAD' 'Anti-social behaviour' 7 2015-01 -5.915793 54.59242 'RAVENHILL REACH' 'Anti-social behaviour' 8 2015-01 -5.535389 54.48792 NA 'Anti-social behaviour' 9 2015-01 -7.322812 54.99940 'GREAT JAMES STREET' 'Anti-social behaviour' 10 2015-01 -5.954670 54.61568 'JAMAICA ROAD' 'Anti-social behaviour'", header = TRUE, stringsAsFactors = FALSE) CleanNIPostcodeData <- read.table(text = "Number Primary_Thorfare Alt_Thorfare Secondary_Thorfare Locality Townland Town County Postcode 1 134 'WHITEPARK ROAD' <NA> <NA> BALLINTOY 'BALLINTOY DEMESNE' BALLYCASTLE ANTRIM BT546ND 2 27 'PRINCESS STREET' <NA> <NA> <NA> 'PORT RUSH' PORTRUSH ANTRIM BT568AX 3 <NA> 'COVEHILL COURT' <NA> <NA> <NA> GLENAMANUS PORTRUSH ANTRIM BT568GL 4 271 'OLDPARK ROAD' <NA> <NA> <NA> 'TOWN PARKS' BELFAST ANTRIM BT146QR 5 2A 'RAMORE STREET' <NA> <NA> <NA> 'PORT RUSH' PORTRUSH ANTRIM BT568BD 6 52 'JAMAICA ROAD' <NA> <NA> <NA> 'PORT RUSH' PORTRUSH ANTRIM BT568DY", header = TRUE, stringsAsFactors = FALSE)

R - 将街道名称与来自不同df的邮政编码匹配，并将结果应用于新列

1 个答案: