在我希望保持井号标签符号及其单词完好无损的情况下(即#company而不是#and company)如何更改默认设置
x_mod <- udpipe_load_model("D:/Users/asongara/Documents/english-ewt-ud-2.3-181115.udpipe")
ud_model <- udpipe_load_model(x_mod$file)
anno_op3 <- udpipe_annotate(ud_model,
"This is a better #company than i thought @mr_jones!",
tokenizer = "tokenizer",
tagger = "default",
trace = TRUE)
anno_op3 <- as.data.table(as.data.frame(anno_op3))
View(anno_op3)
我得到的是#和company作为两个不同的令牌。我希望#company作为单个令牌。虽然我将@mr_jones作为单个令牌。
答案 0 :(得分:0)
您可以将其他标记化工具与udpipe R软件包结合使用。显示在https://bnosac.github.io/udpipe/docs/doc2.html。例如。下面使用了特定于Twitter消息的标记程序,然后使用udpipe完成部分语音标记,形态特征注释和依存关系解析
library(tokenizers)
library(udpipe)
x <- tokenize_tweets(c("#rstats is a programming_language", "you can combine the #tokenizers package with @udpipe parsing"),
lowercase = FALSE, strip_punct = FALSE)
x <- sapply(x, FUN=function(x) paste(x, collapse="\n"))
x <- udpipe(x, "english-ewt", tokenizer = "vertical", trace = TRUE)
x
doc_id paragraph_id sentence_id sentence start end term_id token_id token lemma upos xpos feats head_token_id dep_rel deps misc
doc1 1 1 <NA> 1 7 1 1 #rstats #rstat PRON PRP$ Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs 4 nsubj <NA> <NA>
doc1 1 1 <NA> 9 10 2 2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop <NA> <NA>
doc1 1 1 <NA> 12 12 3 3 a a DET DT Definite=Ind|PronType=Art 4 det <NA> <NA>
doc1 1 1 <NA> 14 33 4 4 programming_language programming_language NOUN NN Number=Sing 0 root <NA> <NA>
doc2 1 1 <NA> 1 3 1 1 you you PRON PRP Case=Nom|Person=2|PronType=Prs 3 nsubj <NA> <NA>
doc2 1 1 <NA> 5 7 2 2 can can AUX MD VerbForm=Fin 3 aux <NA> <NA>
doc2 1 1 <NA> 9 15 3 3 combine combine VERB VB VerbForm=Inf 0 root <NA> <NA>
doc2 1 1 <NA> 17 19 4 4 the the DET DT Definite=Def|PronType=Art 6 det <NA> <NA>
doc2 1 1 <NA> 21 31 5 5 #tokenizers #tokenizer NOUN NNS Number=Plur 6 compound <NA> <NA>
doc2 1 1 <NA> 33 39 6 6 package package NOUN NN Number=Sing 3 obj <NA> <NA>
doc2 1 1 <NA> 41 44 7 7 with with ADP IN <NA> 9 case <NA> <NA>
doc2 1 1 <NA> 46 52 8 8 @udpipe @udpipe NOUN NN Number=Sing 9 compound <NA> <NA>
doc2 1 1 <NA> 54 60 9 9 parsing parsing NOUN NN Number=Sing 6 nmod <NA> <NA>
>