R stringr和str_extract_all:捕获收缩

时间:2014-09-29 07:48:11

标签: regex r tokenize

我正在使用R做一些NLP并使用stringr package来标记一些文字。

我希望能够捕捉收缩,例如赢得,以便将其标记为" wo" " N' T"

以下是我所得到的一个示例:

library(stringr)

s = "won't you buy my raspberries?"

foo = str_extract_all(s, "(n|t)|[[:punct:]]" )          # captures the contraction OK...
foo[[1]]
>[1] "n't" "?"  

foo = str_extract_all(s, "(n|t)|\\w+|[[:punct:]]" )       # gets all words, 
                                                  # but splits the contraction! 
foo[[1]]
>[1] "won"  "'"  "t"  "you"  "buy"  "my"  "raspberries"  "?"  

我正在尝试将上述句子标记为" wo" " n&#t;#34; "您" "购买" "我的" & #34; raspberries" "?"

我不太确定我是否可以使用default, extended regular expressions执行此操作,或者我需要找出一些方法来执行类似Perl的模式。

有没有人知道如上所述使用stringr package进行标记化的方法?

修改 澄清一下,我对Treebank tokenization

感兴趣

2 个答案:

答案 0 :(得分:2)

您可以通过PCRE库支持的前瞻来完成此操作。

> s = "won't you buy my raspberries?"
> s
[1] "won't you buy my raspberries?"
> m <- gregexpr("\\w+(?=n[[:punct:]]t)|n?[[:punct:]]t?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo"          "n't"         "you"         "buy"         "my"         
[6] "raspberries" "?" 

OR

> m <- gregexpr("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo"          "n't"         "you"         "buy"         "my"         
[6] "raspberries" "?" 

OR

通过stringr库,

> s <- "won't you buy my raspberries?"
> str_extract_all(s, perl("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+") )[[1]]
[1] "wo"          "n't"         "you"         "buy"         "my"         
[6] "raspberries" "?"  

答案 1 :(得分:2)

使用perl包函数时,您可以尝试使用stringr包装函数。

s <- "won't you buy my raspberries?"
pattern <- "(?=[a-z]'[a-z])|(\\s+)|(?=[!?.])"
library(stringr)
str_split(s, perl(pattern))[[1]]
# [1] "wo"          "n't"         "you"         "buy"         "my"         
# [6] "raspberries" "?" 

还有其他包装,例如fixedignore.case