我有这个示例数据框:
address <- c("11537 W LARKSPUR RD EL MIRAGE 85335", "6702 E CPT DREYFUS SCOTTSDALE 85254", "114 S PUEBLO ST GILBERT 85233", "16981 W YOUNG ST SURPRISE 85388")
person <- c("Maria", "Jose", "Adan", "Eva")
my_address <- tibble(person, address)
我需要从city
列中提取address
。城市可以由1个字或2个字组成,但它们始终位于包含 5位数字 的邮政编码附近。
从数据框中,我想在新列:
中获得“ EL MIRAGE”,“ SCOTTSDALE”和“ GILBERT”city
重要提示:
城市总是位于2或3个字母词之后,例如:ST,AVE,RD。
例如,来自:“ 16981 W YOUNG ST SURPRISE 85388”。我想获得“ ST”之后的惊喜。
所以,我正在尝试此正则表达式:
my_address$city <-gsub("(.*)([a-zA-Z])([0-9]{5})(.*)", "\\2", my_address$address)
但是它返回列中的所有文本,而不是所需的城市。另外,我注意到我没有指示它在5位数字之前检查1或2个单词,所以它只会提取1个单词?
更新1:
string1 <- "114 S PUEBLO ST GILBERT 85233"
sapply(stringr::str_extract_all(string1,"\\w{4,}"),"[",3)
返回:85233
,应为GILBERT
。
答案 0 :(得分:2)
通常更喜欢使用一个衬管,尽管这似乎过于复杂,并且需要在“ SURPRISE”之前删除“ ST”的另一步骤。假设一切都以“ ST”开头,就已经完成了。
library(stringr)
new_s<-unlist(str_extract_all(my_address$address,"\\w{2,} \\w{3,}"))
newer_s<-str_remove_all(new_s,"^\\w{3}.*\\D$")
newer_s<-str_remove_all(newer_s,"\\s.*\\d")
res<-str_remove_all(newer_s,"^ST ")
res[res==""]<-NA
my_address$city<-res[complete.cases(res)]
结果:
my_address
# A tibble: 4 x 3
# person address city
# <chr> <chr> <chr>
#1 Maria 11537 W LARKSPUR RD EL MIRAGE 85335 EL MIRAGE
#2 Jose 6702 E CPT DREYFUS SCOTTSDALE 85254 SCOTTSDALE
#3 Peter 16981 W YOUNG ST SURPRISE 85388 SURPRISE
#4 Paul 114 S PUEBLO ST GILBERT 85233 GILBERT
数据:
address <- c("11537 W LARKSPUR RD EL MIRAGE 85335", "6702 E CPT DREYFUS SCOTTSDALE 85254",
"16981 W YOUNG ST SURPRISE 85388","114 S PUEBLO ST GILBERT 85233")
person <- c("Maria", "Jose","Peter","Paul")
my_address <- tibble::tibble(person, address)
答案 1 :(得分:2)
这种dplyr + stringr / tidyverse解决方案基于以下事实:您知道2-3个字母词在城市之前是什么...
# vector with 2-3 letter words before a city?
v.before <- c("ST", "RD", "AVE")
#with this vector, we can build an 'or'-pattern for a regex
library( dplyr )
library( stringr )
data.frame( person, address) %>%
mutate( place = stringr::str_extract( address, paste0("(?<=", paste0(v.before, collapse = " |" ), " ).*(?= [0-9]{5})") ) ) %>%
#no match found?, then the city is the second last word from address
mutate( place = ifelse( is.na( place ), stringr::word(address, -2), place))
# person address place
# 1 Maria 11537 W LARKSPUR RD EL MIRAGE 85335 EL MIRAGE
# 2 Jose 6702 E CPT DREYFUS SCOTTSDALE 85254 SCOTTSDALE
# 3 Adan 114 S PUEBLO ST GILBERT 85233 GILBERT
# 4 Eva 16981 W YOUNG ST SURPRISE 85388 SURPRISE