Question

我正在尝试提取下面字符串中包含的所有单词＆＃39;令牌＆＃39;只有“代币”＃39;发生在＆＃39;标签（名词）＆＃39;。

之后

例如，我有字符串：

m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."

我希望得到一个列表，其中列出了“标记”后面括号内的所有单词。只有当单词标记出现在＆＃39;标记（名词）＆＃39;。

之后

因此，我希望我的输出是以下的矢量：

[1] new, york, state, department

我该怎么做？我假设我必须使用正则表达式，但我在如何在R中写这个时失去了。

谢谢！

Answer 1

删除换行符，然后在模式pat中提取与括号中的部分匹配的部分。然后用逗号分隔这些字符串并简化为字符向量：

library(gsubfn)

pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)

，并提供：

[1] "new"        "york"       "state"      "department"

可视化：以下是pat中正则表达式的debuggex表示。（注意，我们需要在放入R＆＃39的双引号时加倍反斜杠）：

 tag.noun.,tokens..(.*?)\]

Regular expression visualization

Debuggex Demo

请注意.*?表示匹配任何字符的最短字符串，以便整个模式匹配 - 没有?它会尝试匹配最长的字符串。

Answer 2

这样的事情怎么样？在这里，我将使用regcatputedmatches辅助函数来更轻松地提取捕获的匹配项。

m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."

rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
    unlist(strsplit(c(x),","))
})

# [[1]]
# [1] "new"        "york"       "state"      "department"

正则表达式有点乱，因为你想要的匹配包含许多特殊的正则表达式符号，所以我们需要正确地转义它们。

Answer 3

如果您愿意，可以使用以下内容：

paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"

细分：

# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)

# Extract the matches
matches <- regmatches(m, indices)

# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

在R中的特定文本之后提取字符串的子集

3 个答案: