从文本文件中提取字符串

时间:2017-10-06 05:05:28

标签: r text-mining

我想在文本文件中提取两个单词( start,end )之间的字符串,但是想要在第二次出现 start 之后开始提取,直到

例如,我的文字是

test.text <- c("During the year new factories at Haridwar for LV apparatus and at Bangalore for LV electric motors commenced production. Further increases in range and LV switchgear capacity augmentation are planned for  motors, HT motors, Drives and .")

我需要在第二个&#34; LV&#34;之后开始提取文本。 (忽略后来的那个)(不区分大小写)直到&#34; capacity&#34;。

输出应该是:

electric motors commenced production. Further increases in range and

2 个答案:

答案 0 :(得分:2)

我们可以找到该位置,然后执行substr

library(stringr)
i1 <- str_locate_all(test.text, "LV")[[1]][2,2]+2
i2 <- str_locate(test.text, "capacity")[[1]]-2
sub("\\sLV.*", "", substr(test.text, i1, i2))
#[1] "electric motors commenced production. Further increases in range and"

答案 1 :(得分:1)

strsplit的解决方案:

strsplit(test.text, "\\sLV\\s")[[1]][3]    
# [1] "electric motors commenced production. Further increases in range and"

strsplit(test.text, "\\s(LV(?!\\sswitchgear)|capacity)\\s", perl = TRUE)[[1]][3]
# [1] "electric motors commenced production. Further increases in range and LV switchgear"

第一行给出OP的预期输出。第二行给出了我认为OP的真正含义。