我需要一些关于此的指示。实际上,我在这里并不一定需要一个完善的解决方案-指向函数和/或包的一些指针将非常有用。
问题:我想在字符向量中找到特定的序列。序列可能有些“未指定”。这意味着应该固定某些元素,但对于某些元素来说,它们长短或确切是什么都没关系。
一个例子:假设我想在字符向量中找到以下模式:
函数的返回值应该是向量中的中间元素和/或其索引。
因此,该函数应像这样“表现”:
c("Out", "of", "specific", "reasons", ".")
返回"specific"
c("Out", "of", "very", "specific", "reasons", ".")
返回c("very", "specific"
)c("out", "of", "curiosity", ".", "He", "had", "his", "reasons")
返回""
或NA
或NULL
,这无关紧要-仅表示没有结果。正如我所说:我不需要完整的解决方案。凡是指向已实现此类功能的程序包的指针,将不胜感激!
理想情况下,我不想依靠先粘贴文本然后使用正则表达式进行匹配的解决方案。
非常感谢!
答案 0 :(得分:1)
我真的很想知道能满足您需求的软件包。我的倾向是折叠字符串并使用正则表达式或查找程序员或使用perl。但是,这是R中的一个可扩展解决方案,还有更多案例需要试验。不是很优雅,但是看看它是否有实用程序。
# Recreate data as a list with a few more edge cases
txt1 <- c(
"Out of specific reasons.",
"Out of very specific reasons.",
"Out of curiosity. He had his reasons.",
"Out of reasons.",
"Out of one's mind.",
"For no particular reason.",
"Reasons are out of the ordinary.",
"Out of time and money and for many good reasons, it seems.",
"Out of a box, a car, and for random reasons.",
"Floop foo bar.")
txt2 <- strsplit(txt1, "[[:space:]]+") # remove space
txt3 <- lapply(txt2, strsplit, "(?=[[:punct:]])", perl = TRUE) #
txt <- lapply(txt3, unlist) # create list of tokens from each line
# Define characters to exclude: [. ! and ?] but not [,]
exclude <- "[.!?]"
# Assign acceptable limit to separation
lim <- 5 # try 7 and 12 to experiment
# Create indices identifying each of the enumerated conditions
fun1 <- function(x, pat) grep(pat, x, ignore.case = TRUE)
index1 <- lapply(txt, fun1, "out")
index2 <- lapply(txt, fun1, "of")
index3 <- lapply(txt, fun1, "reasons")
index4 <- lapply(txt, fun1, exclude)
# Create logical vectors from indices satisfying the conditions
fun2 <- function(set, val) val[1] %in% set
cond1 <- sapply(index1, fun2, val = 1) & sapply(index2, fun2, val = 2)
cond2 <- sapply(index3, "[", 1) < lim + 2 + 2 # position of 'of' + 2
cond3 <- sapply(index3, max, -Inf) < sapply(index4, min, Inf)
# Combine logical vectors to a single logical vector
valid <- cond1 & cond2 & cond3
valid <- ifelse(is.na(valid), FALSE, valid)
# Examine selected original lines
print(txt1[valid])
# Helper function to extract the starting and the ending element
fun3 <- function(index2, index3, valid) {
found <- rep(list(NULL), length(index2))
found[valid] <- Map(seq, index2[valid], index3[valid])
found <- lapply(found, tail, -1)
found <- lapply(found, head, -1)
}
# Extract starting and ending element from valid list members
idx <- fun3(index2, index3, valid)
# Return the results or "" for no intervening text or NULL for no match
ans <- Map(function(x, i) {
if (is.null(i)) NULL # no match found
else if (length(i) == 0) "" # no intervening elements
else x[i]}, # all intervening elements <= lim
txt, idx)
# Show found (non-NULL) values
ans[!sapply(ans, is.null)]
答案 1 :(得分:0)
所以让我们假设你的例子
x <- c("Out", "of", "very", "specific", "reasons", ".")
我们首先需要获得指标的开始
i_Beginning <- as.numeric(grep("Out|out", x))
和结尾
i_end <- as.numeric(grep("reasons", x))
还需要检查Out之后是否有
Is_Of <- grepl("Of|of", x[i_Beginning +1])
如果是这样,我们将提取其他元素
if(Is_Of){
extraction <- x[c(i_Beginning +2, i_end -1)]
}
print(extraction)