标记问题

时间:2018-08-14 22:34:51

标签: r regex tokenize tidytext

我正尝试如下标记一个句子。

Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)

当我使用tidytext和下面的代码标记时,

AA <- df %>%
  mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
         locations = str_locate_all(df$Section, "([^\\s]+)"),
         locations = map(locations, as.data.frame)) %>%
  select(-Section) %>%
  unnest(tokens, locations) 

它给了我一个如下所示的结果集(见图)。

R output for AA

我如何将逗号和句点作为独立令牌而不是“出现”和“注入”的一部分。分别使用tidytext。所以我的令牌应该是

If
an
infusion
reaction
occurs
,
interrupt
the
infusion
.

4 个答案:

答案 0 :(得分:1)

事先用其他东西代替它们。确保在替换之前添加一个空格。然后在空格处分隔句子。

include = c(".", ",") #The symbols that should be included

mystr = Section  # copy data
for (mypattern in include){
    mystr = gsub(pattern = mypattern,
                 replacement = paste0(" ", mypattern),
                 x = mystr, fixed = TRUE)
}
lapply(strsplit(mystr, " "), function(V) data.frame(Tokens = V))
#[[1]]
#      Tokens
#1         If
#2         an
#3   infusion
#4   reaction
#5     occurs
#6          ,
#7  interrupt
#8        the
#9   infusion
#10         .

答案 1 :(得分:0)

这最终会增加您的字符串的长度:

df%>%
  mutate(Section =  gsub("([,.])",' \\1',Section),
  start = gregexpr("\\S+",Section),
  end = list(attr(start[[1]],"match.length")+unlist(start)),
  Section = strsplit(Section,"\\s+"))%>%
  unnest()

     Section start end
1         If     1   3
2         an     4   6
3   infusion     7  15
4   reaction    16  24
5     occurs    25  31
6          ,    32  33
7  interrupt    34  43
8        the    44  47
9   infusion    48  56
10         .    57  58

答案 2 :(得分:0)

这是一种无需先替换任何内容的方法,诀窍是使用[[:punct:]]通配符,该通配符与以下任意一项匹配:

!"#$%&'()*+,\-./:;<=>?@[\]^_`{|}~

模式简单地是\\w+|[[:punct:]]-,它表示:匹配连续的单词字符或标点符号,str_extract_all负责其余部分,将每个单独地拉出。如果您只想分割特定的标点符号,也可以只使用\\w+|[,.]或类似的符号。

AA <- df %>% mutate(
     tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
     locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
     locations = map(locations, as.data.frame)) %>%
  select(-Section) %>%
  unnest(tokens, locations)

      tokens start end
1         If     1   2
2         an     4   5
3   infusion     7  14
4   reaction    16  23
5     occurs    25  30
6          ,    31  31
7  interrupt    33  41
8        the    43  45
9   infusion    47  54
10         .    55  55

答案 3 :(得分:0)

函数unnest_tokens()具有一个strip_punct参数,用于令牌生成器,例如单词令牌生成器。

library(tidyverse)
library(tidytext)

df %>%
  unnest_tokens(word, Section, strip_punct = FALSE)
#> # A tibble: 10 x 1
#>    word     
#>    <chr>    
#>  1 if       
#>  2 an       
#>  3 infusion 
#>  4 reaction 
#>  5 occurs   
#>  6 ,        
#>  7 interrupt
#>  8 the      
#>  9 infusion 
#> 10 .

reprex package(v0.2.0)于2018-08-15创建。