我有这样一句话:
sent <- "She likes long walks on the beach with her dogs."
让我说我逐字逐句地标记。我可以用什么NLP工具来获取这句话中代词的数据,例如主语(第一人称,第二人称,第三人称)和类型(占有欲,反身等)?
答案 0 :(得分:1)
简短回答:您必须实施其他(适当的)启发式方法。例如,检测SUBJECT-VERB-OBJECT模式的快速而肮脏的方法是按照Extract triplet subject, predicate, and object sentence的建议搜索NOUN-VERB-NOUN三元组(或PRONOUN-VERB-NOUN) - 不确定是否存在R中的高级NLP包可以可靠地实现这一点。
在您的数据上,首先使用http://smart-statistics.com/part-speech-tagging-r/创建POS标记(任何POS包都可以):
library(devtools)
devtools::install_github("bnosac/RDRPOSTagger")
library(RDRPOSTagger)
devtools::install_github("ropensci/tokenizers")
library(tokenizers)
然后在您的数据上创建标记:
sent <- "She likes long walks on the beach with her dogs."
unipostagger <- rdr_model(language = "English", annotation = "UniversalPOS")
pos <- rdr_pos(unipostagger, sent)
> pos
doc_id token_id token pos
1 d1 1 She PRON
2 d1 2 likes VERB
3 d1 3 long ADJ
4 d1 4 walks VERB
5 d1 5 on ADP
6 d1 6 the DET
7 d1 7 beach NOUN
8 d1 8 with ADP
9 d1 9 her PRON
10 d1 10 dogs NOUN
11 d1 11 . PUNCT
然后提取模式:
> subj <- pos %>% filter(grepl("PRON|NOUN",pos)) %>% select(token) %>% slice(1)
> verb <- pos %>% filter(grepl("VERB",pos)) %>% select(token) %>% slice(1)
> obj <- pos %>% filter(grepl("PRON|NOUN",pos)) %>% select(token) %>% slice(n())
> paste(subj, verb, obj)
[1] "She likes dogs"
显然,效果取决于句子的复杂程度。