是否有一种通用的方法来删除以R开头的数字并以大写字母结尾的子字符串

时间:2019-05-02 16:37:20

标签: r regex gsub

很难描述,但基本上,我正在尝试找到一种可以做到这一点的 general 方法:

    [1]" On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…" 
    [2]" Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online"

对此:

    [1] "95 E Kennedy Blvd"
    [2] "231 3rd St"

使用R。我知道它涉及正则表达式,但是我并不像我想的那样流利。

谢谢!

3 个答案:

答案 0 :(得分:2)

您的预期输出没有很扎实的逻辑,但是查看预期数据,您可以使用此正则表达式实现您要尝试的工作,

^.*?(\d{2,}.*?[a-z])[A-Z].*

并用\1替换它,因为group1捕获了您想要的文本。

Regex Demo

R Code Demo

sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…")
sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online")

按预期打印,

[1] "95 E Kennedy Blvd"
[1] "231 3rd St"

编辑: 好的,\d{2,}可能与数据有关,因此在这里我们可以使用另一种逻辑,在这里我将仅以一个或多个数字\d+开始捕获,然后以一个或多个空格开始捕获。由于比赛恰好在Lakewood之前停止,因此在正则表达式中也要使用积极的眼光(?=Lakewood),并且可以使用的更新更好的正则表达式是这个

^.*?(\d+\s+.*?)(?=Lakewood).*

Regex Demo 2

现在,如果需要,您甚至可以使用str_match通过正则表达式\d+\s+.*?(?=Lakewood)使用以下代码行提取文本,

library(stringr)

str_match("On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…", "\\d+\\s+.*?(?=Lakewood)")
str_match("Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online", "\\d+\\s+.*?(?=Lakewood)")

打印

     [,1]               
[1,] "95 E Kennedy Blvd"
     [,1]        
[1,] "231 3rd St"

答案 1 :(得分:1)

Pushpesh Kumar Rajwanshianswer很不错,也很笼统。但是,如果您觉得有帮助,请使用以下替代方法:

x <- c(" On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…",
       " Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online")
street_types <- c("Blvd", "St")
address_pattern <- paste("\\d+ .+?", street_types, collapse = "|")
stringr::str_extract_all(string = x, pattern = address_pattern, simplify = TRUE)
#      [,1]               
# [1,] "95 E Kennedy Blvd"
# [2,] "231 3rd St" 

这解决了1位地址号码的问题,并允许您指定街道类型,这可以帮助您防止其他类型的误报(尽管如果您不详尽地指定街道类型,则可能会产生一些误报)。 / p>

答案 2 :(得分:1)

这种方法很好用

(\[\d])(?:.+[^\s\d])((?:\d+\s+)[^\R]+)

Regex Demo

Geshmak!