如何从字符串中提取数字,包括数字之前的所有文本

时间:2019-02-19 09:48:26

标签: r regex

我有一个地址列表,其中包含(1)门牌号和(2)建筑物名称。我希望将字符串分成两列。棘手的部分是一些门牌号包含字符,例如贝克街221B号。

以下示例:

add <- c("5 Ark Royal House" , 
     "22A Blington Garden Lincoln Street", 
     "Flat 19 PICTON HOUSE" , 
     "2-3 Royal Albert Court" , 
     "Room 1 Grand Hall", 
     "No 17 The Dell Alpha House")

理想的结果如下:

aim <- data.frame("No"=as.character(c("5", "22A", "Flat 19", "2-3", "Room 1", "No 17")), 
              "Building" = as.character(c("Ark Royal House", 
                                          "Blington Garden Lincoln Street" , 
                                          "PICTON HOUSE", 
                                          "Royal Albert Court" , 
                                          "Grand Hall" , 
                                          "The Dell Alpha House")))

2 个答案:

答案 0 :(得分:3)

使用stringr

library(stringr)
lst <- str_match_all(add, "^(\\D*\\d[-\\w]*)\\s+(.+)")

(aim <- setNames(as.data.frame(do.call(rbind, lst)),
                c("all", "No", "Building")))

或者在香草R中:

pattern <- "^(\\D*\\d[-\\w]*)\\s+(.+)"
lst <- regmatches(add, regexec(pattern, add, perl = T))
(aim <- setNames(as.data.frame(do.call(rbind, lst)),
                 c("all", "No", "Building")))


两者都会产生

                                 all      No                       Building
1                  5 Ark Royal House       5                Ark Royal House
2 22A Blington Garden Lincoln Street     22A Blington Garden Lincoln Street
3               Flat 19 PICTON HOUSE Flat 19                   PICTON HOUSE
4             2-3 Royal Albert Court     2-3             Royal Albert Court
5                  Room 1 Grand Hall  Room 1                     Grand Hall
6         No 17 The Dell Alpha House   No 17           The Dell Alpha House

请参阅regex101.com上的a demo for the expression

答案 1 :(得分:1)

基本方法,找到数字和名称之间的间隙,将其替换为希望的中性字符(在本例中为_,但可能是您知道的任何不在地址中的字符),然后拆分该字符。

它假定包含数字的最后一个“单词”是“否”部分的结尾。如果对于您的所有地址(对于您的所有测试用例)都不是正确的,那么这将无效。

add <- c("5 Ark Royal House" , 
  "22A Blington Garden Lincoln Street", 
  "Flat 19 PICTON HOUSE" , 
  "2-3 Royal Albert Court" , 
  "Room 1 Grand Hall", 
  "No 17 The Dell Alpha House")

split_add <- strsplit(gsub('([0-9\\-]+[0-9A-z]*) ', '\\1_', add), split='_')

aim <- setNames(as.data.frame(do.call(rbind, split_add)),
  c('No', 'Building'))

aim
#>        No                       Building
#> 1       5                Ark Royal House
#> 2     22A Blington Garden Lincoln Street
#> 3 Flat 19                   PICTON HOUSE
#> 4     2-3             Royal Albert Court
#> 5  Room 1                     Grand Hall
#> 6   No 17           The Dell Alpha House

reprex package(v0.2.1)于2019-02-19创建