我想编辑字符串的地址,例如:
test = c("[Mavlyanova, Nadira G.] Uzbek Acad Sci, GA Mavlyanov Inst Seismol, Tashkent 700135, Uzbekistan; [Markovic, Slobodan B.] Univ Novi Sad, Fac Sci, Chair Phys Geog, Novi Sad 21000, Serbia; [Rowell, G.] Univ Adelaide, Sch Chem & Phys, Adelaide, SA 5005, Australia; [Katarzynski, K.] Nicholas Copernicus Univ, Torun Ctr Astron, PL-87100 Torun, Poland; [Ansari, Z.; Boettcher, M.; Manschwetus, B.; Rottke, H.; Sandner, W.] Max Born Inst, D-12489 Berlin, Germany; [Milosevic, D. B.] Univ Sarajevo, Fac Sci, Sarajevo 71000, Bosnia & Herceg")
我想只获得国家/地区名称。这是我到目前为止所尝试的:
> testa <- gsub("\\[.*?\\] ", "", test) #remove square brackets
> testa <- strsplit(testa, ";", fixed = TRUE) #split adresses
> testa <- sapply(testa, function(x) gsub("^.*, ([A-Za-z ]*)$", "\\1", x)) #keep only what's after last comma
> testa <- gsub("^ | $", "", testa) #remove spaces
> testa
[,1]
[1,] "Uzbekistan"
[2,] "Serbia"
[3,] "Australia"
[4,] "Poland"
[5,] "Germany"
[6,] "Univ Sarajevo, Fac Sci, Sarajevo 71000, Bosnia & Herceg"
不幸的是,这对最后一个地址不起作用。我希望得到以下输出:
> testa
[,1]
[1,] "Uzbekistan"
[2,] "Serbia"
[3,] "Australia"
[4,] "Poland"
[5,] "Germany"
[6,] "Bosnia & Herceg"
我的问题是:
答案 0 :(得分:4)
为什么不向后工作?
testa <- gsub("\\[.*?\\] ", "", test)
testa <- strsplit(testa, ";", fixed = TRUE)
# Remaining steps in question are unnecessary with the solution below
> sub(".+, ([A-Za-z& ]+)$","\\1",testa[[1]])
[1] "Uzbekistan" "Serbia" "Australia" "Poland" "Germany" "Bosnia & Herceg"
答案 1 :(得分:2)
您的代码存在的问题是代码中的“最后一个逗号后面的所有内容”部分使用[A-Za-z ]
作为此后唯一有效的字符。此集不包括&
,因此不会对最后一个地址执行替换。也许您应该使用[^,]
来表示“除了逗号之外的任何东西”。
答案 2 :(得分:1)
这里已有一些更好的答案,但我已经解决了这个问题所以我认为我还会发布:
y <- unlist(strsplit(test, "\\["))
y <- y[y!=""]
z <- sapply(y, function(x) strsplit(x, ","))
lens <- sapply(z, length)
a <- sapply(seq_along(z), function(i) z[[i]][lens[i]])
a <- gsub(";", "", a)
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
Trim(a)
答案 3 :(得分:1)
在gsubfn包中使用strapplyc
(或strapply
也可以,但strapplyc
更快)的单行内容。首先将";"
添加到test
,然后搜索[
(使用正则表达式"\\["
),后跟除[
之外的任何字符的字符串(使用正则表达式"[^[]+"
)后跟逗号和空格(", "
),后跟除逗号,分号或[
以外的任何字符序列(使用正则表达式"([^,;[]+)"
),后跟分号(;
)并仅返回括号内的部分:
> library(gsubfn)
> strapplyc(paste0(test, ";"), "\\[[^[]+, ([^,;[]+);", simplify = TRUE)
[,1]
[1,] "Uzbekistan"
[2,] "Serbia"
[3,] "Australia"
[4,] "Poland"
[5,] "Germany"
[6,] "Bosnia & Herceg"