我正尝试如下标记一个句子。
Section <- c("If an infusion reaction occurs, interrupt the infusion.")
df <- data.frame(Section)
当我使用tidytext和下面的代码标记时,
AA <- df %>%
mutate(tokens = str_extract_all(df$Section, "([^\\s]+)"),
locations = str_locate_all(df$Section, "([^\\s]+)"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)
它给了我一个如下所示的结果集(见图)。
我如何将逗号和句点作为独立令牌而不是“出现”和“注入”的一部分。分别使用tidytext。所以我的令牌应该是
If
an
infusion
reaction
occurs
,
interrupt
the
infusion
.
答案 0 :(得分:1)
事先用其他东西代替它们。确保在替换之前添加一个空格。然后在空格处分隔句子。
include = c(".", ",") #The symbols that should be included
mystr = Section # copy data
for (mypattern in include){
mystr = gsub(pattern = mypattern,
replacement = paste0(" ", mypattern),
x = mystr, fixed = TRUE)
}
lapply(strsplit(mystr, " "), function(V) data.frame(Tokens = V))
#[[1]]
# Tokens
#1 If
#2 an
#3 infusion
#4 reaction
#5 occurs
#6 ,
#7 interrupt
#8 the
#9 infusion
#10 .
答案 1 :(得分:0)
这最终会增加您的字符串的长度:
df%>%
mutate(Section = gsub("([,.])",' \\1',Section),
start = gregexpr("\\S+",Section),
end = list(attr(start[[1]],"match.length")+unlist(start)),
Section = strsplit(Section,"\\s+"))%>%
unnest()
Section start end
1 If 1 3
2 an 4 6
3 infusion 7 15
4 reaction 16 24
5 occurs 25 31
6 , 32 33
7 interrupt 34 43
8 the 44 47
9 infusion 48 56
10 . 57 58
答案 2 :(得分:0)
这是一种无需先替换任何内容的方法,诀窍是使用[[:punct:]]
通配符,该通配符与以下任意一项匹配:
!"#$%&'()*+,\-./:;<=>?@[\]^_`{|}~
模式简单地是\\w+|[[:punct:]]
-,它表示:匹配连续的单词字符或标点符号,str_extract_all
负责其余部分,将每个单独地拉出。如果您只想分割特定的标点符号,也可以只使用\\w+|[,.]
或类似的符号。
AA <- df %>% mutate(
tokens = str_extract_all(Section, "\\w+|[[:punct:]]"),
locations = str_locate_all(Section, "\\w+|[[:punct:]]"),
locations = map(locations, as.data.frame)) %>%
select(-Section) %>%
unnest(tokens, locations)
tokens start end
1 If 1 2
2 an 4 5
3 infusion 7 14
4 reaction 16 23
5 occurs 25 30
6 , 31 31
7 interrupt 33 41
8 the 43 45
9 infusion 47 54
10 . 55 55
答案 3 :(得分:0)
函数unnest_tokens()
具有一个strip_punct
参数,用于令牌生成器,例如单词令牌生成器。
library(tidyverse)
library(tidytext)
df %>%
unnest_tokens(word, Section, strip_punct = FALSE)
#> # A tibble: 10 x 1
#> word
#> <chr>
#> 1 if
#> 2 an
#> 3 infusion
#> 4 reaction
#> 5 occurs
#> 6 ,
#> 7 interrupt
#> 8 the
#> 9 infusion
#> 10 .
由reprex package(v0.2.0)于2018-08-15创建。