Question

假设我有这句话：

text<-("I want to find both the greatest cake of the world but also some very great cakes but I want to find this last part : isn't it")

当我写这篇文章时（kwic是一个quanteda函数）：

kwic(text,phrase("great* cake*"))

我得到了

[text1, 7:8]    want to find both the | greatest cake | of the world but also
[text1, 16:17] world but also some very |  great cakes  | but I want to find

但是，当我这样做时

 kwic(text,phrase("great*cake*"))

我得到一个{0}行的kwic个对象，即没有

我想知道*究竟取代了什么，更重要的是，如何在通配符中“考虑”空格？

Answer 1

要回答*匹配的内容，您需要了解＆＃34; glob＆＃34; valuetype，您可以阅读有关使用?valuetype和here的信息。简而言之，*匹配任意数量的任何字符，包括无。请注意，这与它在正则表达式中的使用非常不同，这意味着＆＃34;匹配前面的字符中没有一个或多个＆＃34;。

pattern中的kwic()参数在对文本进行标记后，每个标记匹配一个模式。即使包含在phrase()函数中，它仍然只考虑与令牌匹配的序列。因此，您无法匹配空白（定义标记之间的边界），除非您实际将这些包含在标记的值本身内。

你怎么能这样做？像这样：

toksbi <- tokens(text, ngrams = 2, concatenator = " ")
# tokens from 1 document.
# text1 :
#  [1] "I want"        "want to"       "to find"       "find both"     "both the"     
#  [6] "the greatest"  "greatest cake" "cake of"       "of the"        "the world"    
# [11] "world but"     "but also"      "also some"     "some very"     "very great"   
# [16] "great cakes"   "cakes but"     "but I"         "I want"        "want to"      
# [21] "to find"       "find this"     "this last"     "last part"     "part :"       
# [26] ": isn't"       "isn't it"   

kwic(toksbi, "great*cake*", window = 2)

#  [text1, 7] both the the greatest | greatest cake | cake of of the 
# [text1, 16]  some very very great |  great cakes  | cakes but but I

但建议使用kwic(text, phrase("great* cake*"))的原始用法。

如何让空格成为通配符？

1 个答案: