我正在使用R做一些NLP并使用stringr package来标记一些文字。
我希望能够捕捉收缩,例如赢得,以便将其标记为" wo" 和" N' T"
以下是我所得到的一个示例:
library(stringr)
s = "won't you buy my raspberries?"
foo = str_extract_all(s, "(n|t)|[[:punct:]]" ) # captures the contraction OK...
foo[[1]]
>[1] "n't" "?"
foo = str_extract_all(s, "(n|t)|\\w+|[[:punct:]]" ) # gets all words,
# but splits the contraction!
foo[[1]]
>[1] "won" "'" "t" "you" "buy" "my" "raspberries" "?"
我正在尝试将上述句子标记为" wo" ," n&#t;#34; ,"您" ,"购买" ,"我的" ,& #34; raspberries" ,"?" 。
我不太确定我是否可以使用default, extended regular expressions执行此操作,或者我需要找出一些方法来执行类似Perl的模式。
有没有人知道如上所述使用stringr package进行标记化的方法?
修改 澄清一下,我对Treebank tokenization
感兴趣答案 0 :(得分:2)
您可以通过PCRE库支持的前瞻来完成此操作。
> s = "won't you buy my raspberries?"
> s
[1] "won't you buy my raspberries?"
> m <- gregexpr("\\w+(?=n[[:punct:]]t)|n?[[:punct:]]t?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo" "n't" "you" "buy" "my"
[6] "raspberries" "?"
OR
> m <- gregexpr("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo" "n't" "you" "buy" "my"
[6] "raspberries" "?"
OR
通过stringr
库,
> s <- "won't you buy my raspberries?"
> str_extract_all(s, perl("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+") )[[1]]
[1] "wo" "n't" "you" "buy" "my"
[6] "raspberries" "?"
答案 1 :(得分:2)
使用perl
包函数时,您可以尝试使用stringr
包装函数。
s <- "won't you buy my raspberries?"
pattern <- "(?=[a-z]'[a-z])|(\\s+)|(?=[!?.])"
library(stringr)
str_split(s, perl(pattern))[[1]]
# [1] "wo" "n't" "you" "buy" "my"
# [6] "raspberries" "?"
还有其他包装,例如fixed
和ignore.case