如何提取已添加到R中的字符串的文本

时间:2018-07-03 13:30:05

标签: r string substring extract

我有一个已知格式的字符串,例如:

"This string will have additional text here *, and it will have more here ^, and finally there will be more here ~ with some text after."

然后会有一条数据

"This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."

,其中插入的文本长度不一定总是相同。我需要一种方法来识别*,^,〜中的每一个在第二个字符串中等于什么:

* = "about things"
^ = "regarding other stuff"
~ = "near the end"

新的字符串不会以任何分隔的文本,但是希望模板字符串在每个可选位之间具有足够独特的文本,您可以每次识别该文本。

我已经尝试过环顾四周,但是找不到与我要求的内容类似的东西,任何软件包或功能都将非常有帮助!

3 个答案:

答案 0 :(得分:1)

我现在不是最好的解决方案,但是我将用定界符替换已知部分(或者在开头和与处不添加任何内容),然后用该定界符分割结果文本。

text = "This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
temp = gsub("This string will have additional text here ", "", text)
temp = gsub(", and it will have more here ", "^", temp)
temp = gsub(", and finally there will be more here ", "^", temp)
temp = gsub(" with some text after.", "", temp)
solution = unlist(strsplit(temp, "\\^"))
solution

答案 1 :(得分:1)

使用@Benjamin Schlegel’s answer软件包对stringr进行了细微改动,使已知零件及其替换件(在视觉上)保持更近的距离。

library(stringr)

text <- "This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."

text_repl <-
  str_replace_all(
    text,
    c(
      "This string will have additional text here " = "",
      ", and it will have more here "               = "^",
      ", and finally there will be more here "      = "^",
      " with some text after."                      = ""
    )
  )

str_split(text_repl, "\\^", simplify = TRUE)
#>      [,1]           [,2]                    [,3]          
#> [1,] "about things" "regarding other stuff" "near the end"

str_split()返回一个字符向量列表(simplify = FALSE)或一个字符矩阵(simplify = TRUE),可以轻松地将其转换为data.frame。

答案 2 :(得分:1)

也许您可以查看〜,*和^等之前和之后的单词的独特模式,并将它们放在这样的向量中:

priorstrings <- c("text here", "have more here", "be more here")
afterstrings <- c("and it", "and finally", "with some")  

然后通过检查是否真正唯一来检查这些

length(unique(priorstrings)) == length(priorstrings)
length(unique(afterstrings)) == length(afterstrings)

两者均为TRUE。

然后将它们粘贴在一起,并环顾四周,就像这样:

fullsearches <- paste0(priorstrings, " (.*? )" , afterstrings)

我再次使用了示例字符串,将其命名为y,并添加了另一个名为z的字符串:

y <- "This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
z <- "This string will have additional text here on this topic, and it will have more here to follow up, and finally there will be more here to finish with some text after."

然后,最后执行以下操作:

sapply(list(y,z), function(x) str_match(x, fullsearches)[,2])

给出:

     [,1]                      [,2]             
[1,] "about things, "          "on this topic, "
[2,] "regarding other stuff, " "to follow up, " 
[3,] "near the end "           "to finish "  

我认为您可以通过这种方式完全添加更多的先验字符串,后继字符串和fullsearchers,并将其应用于更大的字符串列表。