每个人。 我对R中的regex完全陌生,尝试使用带标签的xml文件在较大模式的中间检索较小模式集时遇到问题。
在这里,我有一个由BNC(英国国家语料库)Basic(C5)标签集系统标记的三字序列“增强优势”。具体来说,我只想在此长序列中的每个“ hw =“之后立即检索三个词素化的词。
<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>
有人可以在r中提供gsub或其他功能的解决方案吗?提前非常感谢!
NF
答案 0 :(得分:0)
vec <- "<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>"
m <- gregexpr("(?<=hw=)\\S+", vec, perl = T)
regmatches(vec, m)
# [[1]]
# [1] "reinforce" "the" "advantage"
从regex101.com复制
/
(?<=hw=)\S+
/
Positive Lookbehind (?<=hw=)
Assert that the Regex below matches
hw= matches the characters hw= literally (case sensitive)
\S+ matches any non-whitespace character (equal to [^\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible,
giving back as needed (greedy)
首先?unlist
,然后折叠(?paste0
)
paste0(unlist(
regmatches(vec, m)
), collapse = " ")
# [1] "reinforce the advantage"