我有一个数据框中的城市,州数据列表。 我只需要提取状态缩写并存储到名为state
的新变量列中。从视觉检查来看,状态始终是字符串中的最后两个字符,它们都是大写的。这个城市,州的数据如下所示:
test <- c("Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA")
我尝试了以下
pattern <- "[, (A-Z){2}]"
strsplit(test, pattern)
输出结果为:
[[1]]
[1] "Anchorage, "
[[2]]
[1] "New York City, "
[[3]]
[1] "Some Place, Another Place, "
EDI: 我使用了另一个常规表达:
pattern2 <- "([a-z, ])"
sp <- strsplit(test, pattern2)
我得到了这些结果:
[[1]]
[1] "A" "" "" "" "" "" "" "" "" "" "AK"
[[2]]
[1] "N" "" "" "Y" "" "" "" "C" "" "" "" "" "NY"
[[3]]
[1] "S" "" "" "" "P" "" "" "" "" "" "A" "" "" "" "" "" ""
[18] "P" "" "" "" "" "" "LA"
所以,缩写就在那里,但是当我尝试使用sapply()
进行提取时,我不知道如何获取列表的最后一个元素。我知道如何获得第一个:
sapply(sp, "[[", 1)
答案 0 :(得分:4)
我不确定你真的需要一个正则表达式。如果您总是只想要字符串的最后两个字符,请使用
substring(test, nchar(test)-1, nchar(test))
[1] "AK" "NY" "LA"
如果你真的坚持使用正则表达式,至少考虑使用regexec
而不是strsplit
,因为你对分裂并不感兴趣,你只想提取状态。
m <- regexec("[A-Z]+$", test)
unlist(regmatches(test,m))
# [1] "AK" "NY" "LA"
答案 1 :(得分:1)
这可行:
regmatches(test, gregexpr("(?<=[,][\\s+])([A-Z]{2})", test, perl = TRUE))
## [[1]]
## [1] "AK"
##
## [[2]]
## [1] "NY"
##
## [[3]]
## [1] "LA"
解释赞美:http://liveforfaith.com/re/explain.pl
(?<= look behind to see if there is:
[,] any character of: ','
[\\s+] any character of: whitespace (\n, \r,
\t, \f, and " "), '+'
) end of look-behind
( group and capture to \1:
[A-Z]{2} any character of: 'A' to 'Z' (2 times)
) end of \1
答案 2 :(得分:1)
尝试:
tt = strsplit(test, ', ')
tt
[[1]]
[1] "Anchorage" "AK"
[[2]]
[1] "New York City" "NY"
[[3]]
[1] "Some Place" "Another Place" "LA"
z = list()
for(i in tt) z[length(z)+1] = i[length(i)]
z
[[1]]
[1] "AK"
[[2]]
[1] "NY"
[[3]]
[1] "LA"
答案 3 :(得分:0)
我认为你反过来理解'[]'和'()'的含义。 '()'表示匹配一组字符; '[]'表示匹配类中的任何一个字符。你需要的是
“(,[[ - Z] {2})”。
答案 4 :(得分:0)
library(stringr)
str_extract(test, perl('[A-Z]+(?=\\b$)'))
#[1] "AK" "NY" "LA"
答案 5 :(得分:0)
这是同一个
的正则表达式<强>正则表达式强>
(?'state'\w{2})(?=")
测试字符串
"Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA"
<强>结果强>
AK
NY
LA
<强> live demo here 强>
如果需要,您可以删除指定的捕获以使其更小
例如
(\w{2})(?=")