如何将标签及其单词作为单个标记保留

时间:2018-12-21 12:20:20

标签: r token udpipe

在我希望保持井号标签符号及其单词完好无损的情况下(即#company而不是#and company)如何更改默认设置

x_mod <- udpipe_load_model("D:/Users/asongara/Documents/english-ewt-ud-2.3-181115.udpipe")

ud_model <- udpipe_load_model(x_mod$file)
anno_op3 <- udpipe_annotate(ud_model, 
                            "This is a better #company than i thought @mr_jones!", 
                            tokenizer = "tokenizer", 
                            tagger = "default", 
                            trace = TRUE)

anno_op3 <- as.data.table(as.data.frame(anno_op3))

View(anno_op3)

我得到的是#和company作为两个不同的令牌。我希望#company作为单个令牌。虽然我将@mr_jones作为单个令牌。

1 个答案:

答案 0 :(得分:0)

您可以将其他标记化工具与udpipe R软件包结合使用。显示在https://bnosac.github.io/udpipe/docs/doc2.html。例如。下面使用了特定于Twitter消息的标记程序,然后使用udpipe完成部分语音标记,形态特征注释和依存关系解析

library(tokenizers)
library(udpipe)
x <- tokenize_tweets(c("#rstats is a programming_language", "you can combine the #tokenizers package with @udpipe parsing"), 
                     lowercase = FALSE, strip_punct = FALSE)
x <- sapply(x, FUN=function(x) paste(x, collapse="\n"))
x <- udpipe(x, "english-ewt", tokenizer = "vertical", trace = TRUE)
x
 doc_id paragraph_id sentence_id sentence start end term_id token_id                token                lemma upos xpos                                                  feats head_token_id  dep_rel deps misc
   doc1            1           1     <NA>     1   7       1        1              #rstats               #rstat PRON PRP$ Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs             4    nsubj <NA> <NA>
   doc1            1           1     <NA>     9  10       2        2                   is                   be  AUX  VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4      cop <NA> <NA>
   doc1            1           1     <NA>    12  12       3        3                    a                    a  DET   DT                              Definite=Ind|PronType=Art             4      det <NA> <NA>
   doc1            1           1     <NA>    14  33       4        4 programming_language programming_language NOUN   NN                                            Number=Sing             0     root <NA> <NA>
   doc2            1           1     <NA>     1   3       1        1                  you                  you PRON  PRP                         Case=Nom|Person=2|PronType=Prs             3    nsubj <NA> <NA>
   doc2            1           1     <NA>     5   7       2        2                  can                  can  AUX   MD                                           VerbForm=Fin             3      aux <NA> <NA>
   doc2            1           1     <NA>     9  15       3        3              combine              combine VERB   VB                                           VerbForm=Inf             0     root <NA> <NA>
   doc2            1           1     <NA>    17  19       4        4                  the                  the  DET   DT                              Definite=Def|PronType=Art             6      det <NA> <NA>
   doc2            1           1     <NA>    21  31       5        5          #tokenizers           #tokenizer NOUN  NNS                                            Number=Plur             6 compound <NA> <NA>
   doc2            1           1     <NA>    33  39       6        6              package              package NOUN   NN                                            Number=Sing             3      obj <NA> <NA>
   doc2            1           1     <NA>    41  44       7        7                 with                 with  ADP   IN                                                   <NA>             9     case <NA> <NA>
   doc2            1           1     <NA>    46  52       8        8              @udpipe              @udpipe NOUN   NN                                            Number=Sing             9 compound <NA> <NA>
   doc2            1           1     <NA>    54  60       9        9              parsing              parsing NOUN   NN                                            Number=Sing             6     nmod <NA> <NA>
>