R匹配不同数量的单词

时间:2017-05-05 07:03:51

标签: r regex

我认为我的RegEx技能已经足够好但现在我坐在这里不知道如何解决我的问题。

首先我有一个类似的文字:

text <- "This DEV-1231 story is about a man. He DEV-1232 is from DEV-1233 the USA. He is a university professor. He goes DEV-1234 to Nepal. He DEV-1235 climbs a mountain. The mountain is covered in ice. There is a hole in the ice. It is 22 metres deep. The man falls in it. DEV-1236 He doesn’t DEV-1237 go all the way down. He stops somewhere in the hole. He cannot move. His arm and five ribs are broken."

使用一些特殊的独特开发者ID:

dev_id <- "DEV-123[0-9]"

之后使用str_extract_allunlist提取它们没有问题。

但我想提取以下30个字符或5个单词,并结合ID。有时你看,两个ID之间的字符/单词更少,这是我的问题。在这种情况下,只应返回2/3/4个单词。

return
[1] DEV-1231 story is about a man.
[2] DEV-1232 is from
[3] DEV-1233 the USA. He is a
[4] DEV-1234 to Nepal. He
[5] DEV-1235 climbs a mountain. The mountain
[6] DEV-1236 He doesn't
[7] DEV-1237 go all the way down 

在这个例子中,我虽然最多可以将5个单词组合到ID中。这5个字可以标点符号。

提前致谢!

1 个答案:

答案 0 :(得分:1)

DEV-123[0-9]尝试匹配&#34;空格+非空格&#34;一系列最多五次出现((?:\s+\S+){0,5})但需要&#34;非空格&#34;使用否定前瞻不匹配DEV-123[0-9]模式:

DEV-123[0-9](?:\s+(?!DEV-123[0-9])\S+){0,5}

演示:https://regex101.com/r/AxtUkI/1