查找向量中的元素序列

时间:2019-05-16 09:49:20

标签: r string

我需要一些关于此的指示。实际上,我在这里并不一定需要一个完善的解决方案-指向函数和/或包的一些指针将非常有用。

问题:我想在字符向量中找到特定的序列。序列可能有些“未指定”。这意味着应该固定某些元素,但对于某些元素来说,它们长短或确切是什么都没关系。

一个例子:假设我想在字符向量中找到以下模式:

  1. 序列应以“超出”或“超出”开头
  2. 序列应以“原因”结尾
  3. 介于两者之间,还应该有其他元素。但是,无论多少元素(也可以为零)以及元素到底是什么都没有关系。
  4. 在1.和2.之间,不应有“。”,“!”要么 ”?”。
  5. 应该有一个参数来控制3.中的序列最大可以持续产生结果的时间。

函数的返回值应该是向量中的中间元素和/或其索引。

因此,该函数应像这样“表现”:

  • c("Out", "of", "specific", "reasons", ".")返回"specific"
  • c("Out", "of", "very", "specific", "reasons", ".")返回c("very", "specific"
  • c("out", "of", "curiosity", ".", "He", "had", "his", "reasons")返回""NANULL,这无关紧要-仅表示没有结果。

正如我所说:我不需要完整的解决方案。凡是指向已实现此类功能的程序包的指针,将不胜感激!

理想情况下,我不想依靠先粘贴文本然后使用正则表达式进行匹配的解决方案。

非常感谢!

2 个答案:

答案 0 :(得分:1)

我真的很想知道能满足您需求的软件包。我的倾向是折叠字符串并使用正则表达式查找程序员使用perl。但是,这是R中的一个可扩展解决方案,还有更多案例需要试验。不是很优雅,但是看看它是否有实用程序。

# Recreate data as a list with a few more edge cases
  txt1 <- c(
    "Out of specific reasons.",
    "Out of very specific reasons.",
    "Out of curiosity. He had his reasons.",
    "Out of reasons.",
    "Out of one's mind.",
    "For no particular reason.",
    "Reasons are out of the ordinary.",
    "Out of time and money and for many good reasons, it seems.", 
    "Out of a box, a car, and for random reasons.",
    "Floop foo bar.")
  txt2 <- strsplit(txt1, "[[:space:]]+") # remove space
  txt3 <- lapply(txt2, strsplit, "(?=[[:punct:]])", perl = TRUE) #
  txt <- lapply(txt3, unlist) # create list of tokens from each line

# Define characters to exclude: [. ! and ?] but not [,]
  exclude <- "[.!?]"

# Assign acceptable limit to separation
  lim <- 5 # try 7 and 12 to experiment

# Create indices identifying each of the enumerated conditions
  fun1 <- function(x, pat) grep(pat, x, ignore.case = TRUE)
  index1 <- lapply(txt, fun1, "out")
  index2 <- lapply(txt, fun1, "of")
  index3 <- lapply(txt, fun1, "reasons")
  index4 <- lapply(txt, fun1, exclude)

# Create logical vectors from indices satisfying the conditions
  fun2 <- function(set, val) val[1] %in% set
  cond1 <- sapply(index1, fun2, val = 1) & sapply(index2, fun2, val = 2)
  cond2 <- sapply(index3, "[", 1) < lim + 2 + 2 # position of 'of' + 2
  cond3 <- sapply(index3, max, -Inf) < sapply(index4, min, Inf)

# Combine logical vectors to a single logical vector
  valid <- cond1 & cond2 & cond3
  valid <- ifelse(is.na(valid), FALSE, valid)

# Examine selected original lines
  print(txt1[valid])

# Helper function to extract the starting and the ending element
  fun3 <- function(index2, index3, valid) {
    found <- rep(list(NULL), length(index2))
    found[valid] <- Map(seq, index2[valid], index3[valid])
    found <- lapply(found, tail, -1)
    found <- lapply(found, head, -1)
  }

# Extract starting and ending element from valid list members
  idx <- fun3(index2, index3, valid)

# Return the results or "" for no intervening text or NULL for no match
  ans <- Map(function(x, i) {
    if (is.null(i)) NULL # no match found
    else if (length(i) == 0) "" # no intervening elements
    else x[i]}, # all intervening elements <= lim
  txt, idx)

# Show found (non-NULL) values
  ans[!sapply(ans, is.null)]

答案 1 :(得分:0)

所以让我们假设你的例子

x <- c("Out", "of", "very", "specific", "reasons", ".")

我们首先需要获得指标的开始

i_Beginning <- as.numeric(grep("Out|out", x))

和结尾

i_end <-  as.numeric(grep("reasons", x))

还需要检查Out之后是否有

Is_Of <- grepl("Of|of", x[i_Beginning +1])

如果是这样,我们将提取其他元素

if(Is_Of){
extraction <- x[c(i_Beginning +2, i_end -1)]
}
print(extraction)