Question

我正在使用R做一些NLP并使用stringr package来标记一些文字。

我希望能够捕捉收缩，例如赢得，以便将其标记为＆＃34; wo＆＃34; 和＆＃34; N＆＃39; T＆＃34;

以下是我所得到的一个示例：

library(stringr)

s = "won't you buy my raspberries?"

foo = str_extract_all(s, "(n|t)|[[:punct:]]" )          # captures the contraction OK...
foo[[1]]
>[1] "n't" "?"  

foo = str_extract_all(s, "(n|t)|\\w+|[[:punct:]]" )       # gets all words, 
                                                  # but splits the contraction! 
foo[[1]]
>[1] "won"  "'"  "t"  "you"  "buy"  "my"  "raspberries"  "?"

我正在尝试将上述句子标记为＆＃34; wo＆＃34; ，＆＃34; n＆＃t;＃34; ，＆＃34;您＆＃34; ，＆＃34;购买＆＃34; ，＆＃34;我的＆＃34; ，＆＃34; raspberries＆＃34; ，＆＃34;？＆＃34; 。

我不太确定我是否可以使用default, extended regular expressions执行此操作，或者我需要找出一些方法来执行类似Perl的模式。

有没有人知道如上所述使用stringr package进行标记化的方法？

修改澄清一下，我对Treebank tokenization

感兴趣

Answer 1

您可以通过PCRE库支持的前瞻来完成此操作。

> s = "won't you buy my raspberries?"
> s
[1] "won't you buy my raspberries?"
> m <- gregexpr("\\w+(?=n[[:punct:]]t)|n?[[:punct:]]t?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo"          "n't"         "you"         "buy"         "my"         
[6] "raspberries" "?"

OR

> m <- gregexpr("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo"          "n't"         "you"         "buy"         "my"         
[6] "raspberries" "?"

OR

通过stringr库，

> s <- "won't you buy my raspberries?"
> str_extract_all(s, perl("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+") )[[1]]
[1] "wo"          "n't"         "you"         "buy"         "my"         
[6] "raspberries" "?"

Answer 2

使用perl包函数时，您可以尝试使用stringr包装函数。

s <- "won't you buy my raspberries?"
pattern <- "(?=[a-z]'[a-z])|(\\s+)|(?=[!?.])"
library(stringr)
str_split(s, perl(pattern))[[1]]
# [1] "wo"          "n't"         "you"         "buy"         "my"         
# [6] "raspberries" "?"

还有其他包装，例如fixed和ignore.case

R stringr和str_extract_all：捕获收缩

2 个答案: