从R中的已处理文本中删除标签

时间:2016-06-08 08:43:27

标签: r nlp

我有以下文字

txt <- c("This is a short tagging example, by John Doe.",
     "Too bad OpenNLP is so slow on large texts.")

提取代码

extractPOS = function(x, thisPOSregex) {
x <- as.String(x)
wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
POSwords <- subset(POSAnnotation, type == "word")
tags <- sapply(POSwords$features, '[[', "POS")
thisPOSindex <- grep(thisPOSregex, tags)
tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex],tags[thisPOSindex])
untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
untokenizedAndTagged

}

Noun = lapply(txt, extractPOS, "NN")

我提取所有名词词

[[1]]
[1] "tagging/NN example/NN John/NNP Doe/NNP"

[[2]]
[1] "OpenNLP/NNP texts/NNS"

如何解析此输出以获取没有标记的纯文本

tagging,exaple,John,Doe,OpenNLP,texts

1 个答案:

答案 0 :(得分:1)

我们可以使用library(stringr) unlist(str_extract_all(unlist(Noun), "\\w+(?=\\/)")) #[1] "tagging" "example" "John" "Doe" "OpenNLP" "texts"

Noun <- list("tagging/NN example/NN John/NNP Doe/NNP", "OpenNLP/NNP texts/NNS")

数据

self.navigationController.navigationBarHidden = YES;