我想将街道地址分成两列。一个街道号码,其他街道名称

时间:2018-06-01 15:21:47

标签: r split

我有一个df $ addr列,我想分成两列df $ str.num和df $ str.name。一些df $ addr出现包含破折号,这使得难以准确提取街道号码(df $ str.num)。我尝试了很多解决方案,但没有做对。

有什么建议吗?

       addr <- c("84-86 19th Ave",
                 "35 Halsey St",
                 "350 Broad St",
                 "997 S Orange Ave",
                 "274 Chestnut St",
                 "226 Lackawanna Ave",
                 "99 2nd Ave",
                 "261 S Orange Ave",
                 "357 Wilson Ave",
                 "402 Mount Prospect Ave # Lb2",
                 "380-2 Mount Prospect Ave",
                 "105 Lock St # 219",
                 "451 S 15th St")
       df <- data.frame(addr)

3 个答案:

答案 0 :(得分:2)

一个选项是使用tidyr::extractdigit-分隔为str.num,其余为str.name

library(tidyr)

extract(df, addr, c("str.num", "str.name"), regex = "([[:digit:]-]+)\\s(.*)" )

#    str.num                 str.name
# 1    84-86                 19th Ave
# 2       35                Halsey St
# 3      350                 Broad St
# 4      997             S Orange Ave
# 5      274              Chestnut St
# 6      226           Lackawanna Ave
# 7       99                  2nd Ave
# 8      261             S Orange Ave
# 9      357               Wilson Ave
# 10     402 Mount Prospect Ave # Lb2
# 11   380-2       Mount Prospect Ave
# 12     105            Lock St # 219
# 13     451                S 15th St

答案 1 :(得分:1)

与MKR的解决方案非常相似 - 但使用stringr

library(stringr)
pat <- "(^[0-9-]+)[:space:]+([A-Za-z0-9].+)"
str_match(addr, pat)
      [,1]                           [,2]    [,3]                      
 [1,] "84-86 19th Ave"               "84-86" "19th Ave"                
 [2,] "35 Halsey St"                 "35"    "Halsey St"               
 [3,] "350 Broad St"                 "350"   "Broad St"                
 [4,] "997 S Orange Ave"             "997"   "S Orange Ave"            
 [5,] "274 Chestnut St"              "274"   "Chestnut St"             
 [6,] "226 Lackawanna Ave"           "226"   "Lackawanna Ave"          
 [7,] "99 2nd Ave"                   "99"    "2nd Ave"                 
 [8,] "261 S Orange Ave"             "261"   "S Orange Ave"            
 [9,] "357 Wilson Ave"               "357"   "Wilson Ave"              
[10,] "402 Mount Prospect Ave # Lb2" "402"   "Mount Prospect Ave # Lb2"
[11,] "380-2 Mount Prospect Ave"     "380-2" "Mount Prospect Ave"      
[12,] "105 Lock St # 219"            "105"   "Lock St # 219"           
[13,] "451 S 15th St"                "451"   "S 15th St" 

不确定你对正则表达式的熟悉程度,重要的是要注意括号()用于识别我们想要提取的分组模式。

答案 2 :(得分:1)

使用base R,我们可以使用sub函数执行此操作:

data.frame(str.num = sub(" .*", "", addr), str.name = sub("[0-9-]* ", "", addr))

#    str.num                 str.name
# 1    84-86                 19th Ave
# 2       35                Halsey St
# 3      350                 Broad St
# 4      997             S Orange Ave
# 5      274              Chestnut St
# 6      226           Lackawanna Ave
# 7       99                  2nd Ave
# 8      261             S Orange Ave
# 9      357               Wilson Ave
# 10     402 Mount Prospect Ave # Lb2
# 11   380-2       Mount Prospect Ave
# 12     105            Lock St # 219
# 13     451                S 15th St