如何让空格成为通配符?

时间:2018-04-22 14:18:59

标签: r wildcard text-mining quanteda

假设我有这句话:

text<-("I want to find both the greatest cake of the world but also some very great cakes but I want to find this last part : isn't it")

当我写这篇文章时(kwic是一个quanteda函数):

kwic(text,phrase("great* cake*"))

我得到了

[text1, 7:8]    want to find both the | greatest cake | of the world but also
[text1, 16:17] world but also some very |  great cakes  | but I want to find  

但是,当我这样做时

 kwic(text,phrase("great*cake*"))

我得到一个{0}行的kwic个对象,即没有

我想知道*究竟取代了什么,更重要的是,如何在通配符中“考虑”空格?

1 个答案:

答案 0 :(得分:1)

要回答*匹配的内容,您需要了解&#34; glob&#34; valuetype,您可以阅读有关使用?valuetypehere的信息。简而言之,*匹配任意数量的任何字符,包括无。请注意,这与它在正则表达式中的使用非常不同,这意味着&#34;匹配前面的字符中没有一个或多个&#34;。

pattern中的kwic()参数在对文本进行标记后,每个标记匹配一个模式。即使包含在phrase()函数中,它仍然只考虑与令牌匹配的序列。因此,您无法匹配空白(定义标记之间的边界),除非您实际将这些包含在标记的值本身内。

你怎么能这样做?像这样:

toksbi <- tokens(text, ngrams = 2, concatenator = " ")
# tokens from 1 document.
# text1 :
#  [1] "I want"        "want to"       "to find"       "find both"     "both the"     
#  [6] "the greatest"  "greatest cake" "cake of"       "of the"        "the world"    
# [11] "world but"     "but also"      "also some"     "some very"     "very great"   
# [16] "great cakes"   "cakes but"     "but I"         "I want"        "want to"      
# [21] "to find"       "find this"     "this last"     "last part"     "part :"       
# [26] ": isn't"       "isn't it"   

kwic(toksbi, "great*cake*", window = 2)

#  [text1, 7] both the the greatest | greatest cake | cake of of the 
# [text1, 16]  some very very great |  great cakes  | cakes but but I

但建议使用kwic(text, phrase("great* cake*"))的原始用法。