我的文字如下。
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)
当我使用tidytext和下面的代码标记时,
AA <- df %>%
mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
locations = str_locate_all(df$Section, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)
它给了我令牌,开始和结束位置。如何在同时取消嵌套的同时获取POS标签。如下所示(下图中的POS标签可能不正确)
答案 0 :(得分:2)
您可以使用udpipe软件包获取POS数据。 Udpipe自动将标点符号化。
@for(var i = 0; i < 5; i++)
Your code goes here using thing[i]
现在将其合并为您要求的previous question的结果。我回答了其中一个。
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section, stringAsFactors = FALSE)
library(udpipe)
library(dplyr)
udmodel <- udpipe_download_model(language = "english")
udmodel <- udpipe_load_model(file = udmodel$file_model)
x <- udpipe_annotate(udmodel,
df$Section)
x <- as.data.frame(x)
x %>% select(token, upos)
token upos
1 If SCONJ
2 an DET
3 infusion NOUN
4 reaction NOUN
5 occurs NOUN
6 , PUNCT
7 interrupt VERB
8 the DET
9 infusion NOUN
10 . PUNCT
编辑:更好的解决方案
但是最好的解决方案是从udpipe开始,然后再进行其余操作。请注意,我使用的是stringi而不是stringr包。 stringr基于stringi,但是stringi具有更多选项。
library(stringr)
library(purrr)
library(tidyr)
df %>% mutate(
tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations) %>%
mutate(POS = purrr::map_chr(tokens, function(x) as.data.frame(udpipe_annotate(udmodel, x = x, tokenizer = "vertical"))$upos))
tokens start end upos
1 If 1 2 SCONJ
2 an 4 5 DET
3 infusion 7 14 NOUN
4 reaction 16 23 NOUN
5 occurs 25 30 NOUN
6 , 31 31 PUNCT
7 interrupt 33 41 VERB
8 the 43 45 DET
9 infusion 47 54 NOUN
10 . 55 55 PUNCT
答案 1 :(得分:2)
仅供参考。从CRAN的udpipe版本0.7开始,您可以执行以下操作。
library(udpipe)
x <- data.frame(doc_id = c("doc1", "doc2"),
text = c("If an infusion reaction occurs, interrupt the infusion.",
"Houston we have a problem"))
x <- udpipe(x, "english")
x
为您提供(注意您要查找的开始/结束以及令牌/ upos / xpos):
doc_id paragraph_id sentence_id start end term_id token_id token lemma upos xpos feats head_token_id dep_rel deps misc
doc1 1 1 1 2 1 1 If if SCONJ IN <NA> 7 mark <NA> <NA>
doc1 1 1 4 5 2 2 an a DET DT Definite=Ind|PronType=Art 5 det <NA> <NA>
doc1 1 1 7 14 3 3 infusion infusion NOUN NN Number=Sing 4 compound <NA> <NA>
doc1 1 1 16 23 4 4 reaction reaction NOUN NN Number=Sing 5 compound <NA> <NA>
doc1 1 1 25 30 5 5 occurs occur NOUN NNS Number=Plur 7 nsubj <NA> SpaceAfter=No
doc1 1 1 31 31 6 6 , , PUNCT , <NA> 7 punct <NA> <NA>
doc1 1 1 33 41 7 7 interrupt interrupt VERB VB Mood=Imp|VerbForm=Fin 0 root <NA> <NA>
doc1 1 1 43 45 8 8 the the DET DT Definite=Def|PronType=Art 9 det <NA> <NA>
doc1 1 1 47 54 9 9 infusion infusion NOUN NN Number=Sing 7 obj <NA> SpaceAfter=No
doc1 1 1 55 55 10 10 . . PUNCT . <NA> 7 punct <NA> SpacesAfter=\\n
doc2 1 1 1 7 1 1 Houston Houston PROPN NNP Number=Sing 0 root <NA> <NA>
doc2 1 1 9 10 2 2 we we PRON PRP Case=Nom|Number=Plur|Person=1|PronType=Prs 3 nsubj <NA> <NA>
doc2 1 1 12 15 3 3 have have VERB VBP Mood=Ind|Tense=Pres|VerbForm=Fin 1 parataxis <NA> <NA>
doc2 1 1 17 17 4 4 a a DET DT Definite=Ind|PronType=Art 5 det <NA> <NA>
doc2 1 1 19 25 5 5 problem problem NOUN NN Number=Sing 3 obj <NA> SpacesAfter=\\n
答案 2 :(得分:1)
就像上一个回答者一样,我认为udpipe可能是进行POS标记的最简单方法。我最喜欢的与udpipe交互的方式是通过cleanNLP包。调用初始化函数后,只需两行代码即可获得udpipe输出。
library(tidyverse)
library(cleanNLP)
cnlp_init_udpipe()
#> Loading required namespace: udpipe
df <- data_frame(id = 1,
text = c("If an infusion reaction occurs, interrupt the infusion."))
cnlp_annotate(df) %>%
cnlp_get_tif()
#> # A tibble: 10 x 19
#> id sid tid word lemma upos pos cid pid definite mood
#> <chr> <int> <int> <chr> <chr> <chr> <chr> <dbl> <int> <chr> <chr>
#> 1 1 1 1 If if SCONJ IN 0 1 <NA> <NA>
#> 2 1 1 2 an a DET DT 3 1 Ind <NA>
#> 3 1 1 3 infu… infu… NOUN NN 6 1 <NA> <NA>
#> 4 1 1 4 reac… reac… NOUN NN 15 1 <NA> <NA>
#> 5 1 1 5 occu… occur NOUN NNS 24 1 <NA> <NA>
#> 6 1 1 6 , , PUNCT , 30 1 <NA> <NA>
#> 7 1 1 7 inte… inte… VERB VB 32 1 <NA> Imp
#> 8 1 1 8 the the DET DT 42 1 Def <NA>
#> 9 1 1 9 infu… infu… NOUN NN 46 1 <NA> <NA>
#> 10 1 1 10 . . PUNCT . 54 1 <NA> <NA>
#> # ... with 8 more variables: number <chr>, pron_type <chr>,
#> # verb_form <chr>, source <int>, relation <chr>, word_source <chr>,
#> # lemma_source <chr>, spaces <dbl>
由reprex package(v0.2.0)于2018-08-15创建。