quanteda kwic正则表达式操作

时间:2018-03-25 17:25:27

标签: r regex nlp quanteda

进一步修改原始问题 问题源于期望正则表达式可以完全相同或接近“grep”或某些编程语言。以下是我的预期,并且它没有发生的事实产生了我的问题(使用cygwin):

echo "regex unusual operation will deport into a different" > out.txt
grep "will * dep" out.txt
"regex unusual operation will deport into a different"

<小时/> 原始问题
试图关注https://github.com/kbenoit/ITAUR/blob/master/README.md
在看到使用这个包的每个人发现它非常好之后学习Quanteda 在demo.R第22行,我找到了一行:

kwic(immigCorpus, "deport", window = 3)  

其输出为 -

[BNP, 157]        The BNP will | deport | all foreigners convicted  
[BNP, 1946]                . 2. | Deport | all illegal immigrants    
[BNP, 1952] immigrants We shall | deport | all illegal immigrants  
[BNP, 2585]  Criminals We shall | deport | all criminal entrants  

尝试/学习我执行的基础知识

kwic(immigCorpus, "will *depo", window = 3, valuetype = "regex")

期待得到

[BNP, 157]        The BNP will | deport | all foreigners convicted

但我明白了:

kwic object with 0 rows

类似的尝试,如

kwic(immigCorpus, ".*will *depo.*", window = 3, valuetype = "regex")

获得相同的结果:

kwic object with 0 rows

为什么?符号化?如果是这样我应该怎么写正则表达式?

PS感谢这个精彩的套餐

2 个答案:

答案 0 :(得分:0)

ITAUR存储库中的示例基于较旧的语法。您需要的是phrase()包装器 - 请参阅?phrase。您还应该使用*来尝试使用正则表达式语法,因为它可能不是您想要的,并且因为正则表达式不能以“*”开头。 (This可能会有所帮助。)默认的“glob”值类型可能会达到你想要的效果。

library("quanteda")
## Package version: 1.1.4
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

kwic(data_char_ukimmig2010, phrase("will deport"))

## [BNP, 156:157] nation.- The BNP | will deport | all foreigners convicted of crimes

kwic(data_char_ukimmig2010, phrase("will .*deport.*"), valuetype = "regex")

## [BNP, 156:157] nation.- The BNP | will deport | all foreigners convicted of crimes

答案 1 :(得分:0)

You are trying to match a phrase with your pattern. By default, the pattern argument is treated as a space separated list of keywords, and the search is performed against this list. So, you may get your expected result using

> kwic(immigCorpus, phrase("will deport"), window = 3)
[BNP, 156:157] - The BNP | will deport | all foreigners convicted

A valuetype = "regex" makes sense if you are using a regex. E.g. to get both shall and will deport use

> kwic(immigCorpus, phrase("(will|shall) deport"), window = 3, valuetype = "regex")

   [BNP, 156:157]             - The BNP | will deport  | all foreigners convicted
 [BNP, 1951:1952] illegal immigrants We | shall deport | all illegal immigrants  
 [BNP, 2584:2585]  Foreign Criminals We | shall deport | all criminal entrants 

See this kwic documentation.