我有以下文字
txt <- c("This is a short tagging example, by John Doe.",
"Too bad OpenNLP is so slow on large texts.")
提取代码
extractPOS = function(x, thisPOSregex) {
x <- as.String(x)
wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
POSwords <- subset(POSAnnotation, type == "word")
tags <- sapply(POSwords$features, '[[', "POS")
thisPOSindex <- grep(thisPOSregex, tags)
tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex],tags[thisPOSindex])
untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
untokenizedAndTagged
}
Noun = lapply(txt, extractPOS, "NN")
我提取所有名词词
[[1]]
[1] "tagging/NN example/NN John/NNP Doe/NNP"
[[2]]
[1] "OpenNLP/NNP texts/NNS"
如何解析此输出以获取没有标记的纯文本
tagging,exaple,John,Doe,OpenNLP,texts
答案 0 :(得分:1)
我们可以使用library(stringr)
unlist(str_extract_all(unlist(Noun), "\\w+(?=\\/)"))
#[1] "tagging" "example" "John" "Doe" "OpenNLP" "texts"
Noun <- list("tagging/NN example/NN John/NNP Doe/NNP", "OpenNLP/NNP texts/NNS")
self.navigationController.navigationBarHidden = YES;