我经常需要返回文本字符串的一部分(比如文本的中间位用“。”分隔,两端都有文本)。我最终使用了一些基本的代码:1。用strsplit
分割字符串,2。unlist
字符串组件,3。用行数制作matrix
等于字符串子元素的数量,和4.减去我需要的行。必须有更好的方法,对吗?虽然它可能更直接,但我经常无法使用substr
,因为字符串组件的长度在整个向量中不是恒定的。
#make data
set.seed(1)
n <- 50
let1 <- LETTERS[runif(n, min=1, max=26)]
num <- round(runif(100, min=1, max=100))
let2 <- c(LETTERS[runif(n, min=1, max=26)], LETTERS[runif(n, min=1, max=26)])
tmpstr <- paste(let1, num, let2, sep=".")
tmpstr
#resulting string
> tmpstr
[1] "G.48.P" "J.86.N" "O.44.I" "W.25.L" "F.8.M" "W.11.E" "X.32.N"
[8] "Q.52.B" "P.67.G" "B.41.F" "F.91.H" "E.30.W" "R.46.L" "J.34.T"
[15] "T.65.W" "M.27.K" "R.48.B" "Y.77.I" "J.9.S" "T.88.I" "X.35.P"
[22] "F.84.V" "Q.35.V" "D.34.J" "G.48.J" "J.89.W" "A.87.Q" "J.40.S"
[29] "V.78.P" "I.96.W" "M.44.H" "O.72.E" "M.41.W" "E.33.M" "U.76.V"
[36] "Q.21.E" "T.71.S" "C.13.S" "S.25.X" "K.15.N" "U.25.R" "Q.7.J"
[43] "T.65.C" "N.88.X" "N.78.H" "T.80.O" "A.46.C" "L.42.V" "S.81.H"
[50] "R.61.T" "G.66.G" "J.36.F" "O.28.M" "W.99.G" "F.64.E" "W.22.M"
[57] "X.14.O" "Q.48.D" "P.92.G" "B.60.R" "F.98.Y" "E.73.C" "R.36.T"
[64] "J.44.X" "T.16.U" "M.2.H" "R.72.Q" "Y.11.X" "J.45.X" "T.64.I"
[71] "X.99.G" "F.50.E" "Q.49.I" "D.18.M" "G.76.X" "J.46.M" "A.52.G"
[78] "J.22.B" "V.24.K" "I.60.V" "M.58.I" "O.9.D" "M.5.J" "E.65.P"
[85] "U.93.J" "Q.60.R" "T.57.R" "C.53.N" "S.99.K" "K.51.L" "U.69.H"
[92] "Q.61.O" "T.25.W" "N.27.D" "N.73.K" "T.46.F" "A.18.K" "L.75.D"
[99] "S.11.L" "R.87.X"
#possible substring extraction (e.g. the numbers in between the letters)
matrix(unlist(strsplit(tmpstr, ".", fixed = TRUE)), nrow=3)[2,] #version 1
unlist(lapply(as.list(tmpstr), FUN=function(x) strsplit(x, ".", fixed=TRUE)[[1]][2])) #version 2 - not much shorter
#desired result
[1] "48" "86" "44" "25" "8" "11" "32" "52" "67" "41" "91" "30" "46"
[14] "34" "65" "27" "48" "77" "9" "88" "35" "84" "35" "34" "48" "89"
[27] "87" "40" "78" "96" "44" "72" "41" "33" "76" "21" "71" "13" "25"
[40] "15" "25" "7" "65" "88" "78" "80" "46" "42" "81" "61" "66" "36"
[53] "28" "99" "64" "22" "14" "48" "92" "60" "98" "73" "36" "44" "16"
[66] "2" "72" "11" "45" "64" "99" "50" "49" "18" "76" "46" "52" "22"
[79] "24" "60" "58" "9" "5" "65" "93" "60" "57" "53" "99" "51" "69"
[92] "61" "25" "27" "73" "46" "18" "75" "11" "87"
答案 0 :(得分:3)
您可以使用gsub
返回2点之间的任何内容:
gsub('.*[.](.*)[.].*','\\1',tmpstr)
答案 1 :(得分:1)
我相信你不需要做任何复杂的事情 - 你可以使用awesome package stringr的函数str_extract
str_extract(tmpstr,"[0-9]")
这使用正则表达式仅提取数字。显然,你的真实数据可能会使这更复杂,但这应该会给你一个很好的起点。
修改强> 两个完整站点之间的特定提取,全站点被移除(我的正则表达能力差,我必须这么做)
str_replace_all(str_extract(tmpstr,"[.][[:alnum:]]*[.]"),"\\.","")
答案 2 :(得分:1)
以下是两种可能的解决方案:
library(qdap)
unname(unlist(genXtract(tmpstr, ".", ".")))
do.call(rbind, strsplit(tmpstr, "\\."))[, 2]
答案 3 :(得分:1)
以下是一些替代方案。我希望gsub
解决方案最快:
1。删除所有非数字,只留下剩余的数字:
gsub("\\D", "", tmpstr)
2。我们可以使用strapplyc取出数字,而不是删除非数字:
library(gsubfn)
strapplyc(tmpstr, "\\d+", simplify = TRUE)
3。如果中间字段不一定只是数字,那么我们可以执行此操作,删除所有内容,包括第一个点以及最后一个点及其后的所有内容:
gsub("^.*?[.]|[.].*?$", "", tmpstr)
4. 我们也可以使用read.table
,在这种情况下我们根本不需要任何正则表达式:
read.table(text = tmpstr, sep = ".", as.is = TRUE)[[2]]
5. 关于适用于strsplit
的方法,请尝试以下方法:
simplify2array(strsplit(tmpstr, ".", fixed = TRUE))[2,]