很难描述,但基本上,我正在尝试找到一种可以做到这一点的 general 方法:
[1]" On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…"
[2]" Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online"
对此:
[1] "95 E Kennedy Blvd"
[2] "231 3rd St"
使用R。我知道它涉及正则表达式,但是我并不像我想的那样流利。
谢谢!
答案 0 :(得分:2)
您的预期输出没有很扎实的逻辑,但是查看预期数据,您可以使用此正则表达式实现您要尝试的工作,
^.*?(\d{2,}.*?[a-z])[A-Z].*
并用\1
替换它,因为group1捕获了您想要的文本。
sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…")
sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online")
按预期打印,
[1] "95 E Kennedy Blvd"
[1] "231 3rd St"
编辑:
好的,\d{2,}
可能与数据有关,因此在这里我们可以使用另一种逻辑,在这里我将仅以一个或多个数字\d+
开始捕获,然后以一个或多个空格开始捕获。由于比赛恰好在Lakewood
之前停止,因此在正则表达式中也要使用积极的眼光(?=Lakewood)
,并且可以使用的更新更好的正则表达式是这个
^.*?(\d+\s+.*?)(?=Lakewood).*
现在,如果需要,您甚至可以使用str_match
通过正则表达式\d+\s+.*?(?=Lakewood)
使用以下代码行提取文本,
library(stringr)
str_match("On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…", "\\d+\\s+.*?(?=Lakewood)")
str_match("Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online", "\\d+\\s+.*?(?=Lakewood)")
打印
[,1]
[1,] "95 E Kennedy Blvd"
[,1]
[1,] "231 3rd St"
答案 1 :(得分:1)
Pushpesh Kumar Rajwanshi的answer很不错,也很笼统。但是,如果您觉得有帮助,请使用以下替代方法:
x <- c(" On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…",
" Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online")
street_types <- c("Blvd", "St")
address_pattern <- paste("\\d+ .+?", street_types, collapse = "|")
stringr::str_extract_all(string = x, pattern = address_pattern, simplify = TRUE)
# [,1]
# [1,] "95 E Kennedy Blvd"
# [2,] "231 3rd St"
这解决了1位地址号码的问题,并允许您指定街道类型,这可以帮助您防止其他类型的误报(尽管如果您不详尽地指定街道类型,则可能会产生一些误报)。 / p>
答案 2 :(得分:1)