Question

每个人。我对R中的regex完全陌生，尝试使用带标签的xml文件在较大模式的中间检索较小模式集时遇到问题。

在这里，我有一个由BNC（英国国家语料库）Basic（C5）标签集系统标记的三字序列“增强优势”。具体来说，我只想在此长序列中的每个“ hw =“之后立即检索三个词素化的词。

<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>

有人可以在r中提供gsub或其他功能的解决方案吗？提前非常感谢！

NF

Answer 1

vec <- "<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>"

m <- gregexpr("(?<=hw=)\\S+", vec, perl = T)
regmatches(vec, m)

# [[1]]
# [1] "reinforce" "the"       "advantage"

从regex101.com复制

/
(?<=hw=)\S+
/

Positive Lookbehind (?<=hw=)

Assert that the Regex below matches
hw= matches the characters hw= literally (case sensitive)

\S+ matches any non-whitespace character (equal to [^\r\n\t\f\v ])

+ Quantifier — Matches between one and unlimited times, as many times as possible,
giving back as needed (greedy)

首先?unlist，然后折叠（?paste0）

paste0(unlist(
    regmatches(vec, m)
), collapse = " ")

# [1] "reinforce the advantage"

在r中使用gsub的正则表达式模式-从xml文件的大模式中间获取小模式

1 个答案: