假设我有这句话:
text<-("I want to find both the greatest cake of the world but also some very great cakes but I want to find this last part : isn't it")
当我写这篇文章时(kwic
是一个quanteda
函数):
kwic(text,phrase("great* cake*"))
我得到了
[text1, 7:8] want to find both the | greatest cake | of the world but also
[text1, 16:17] world but also some very | great cakes | but I want to find
但是,当我这样做时
kwic(text,phrase("great*cake*"))
我得到一个{0}行的kwic
个对象,即没有
我想知道*
究竟取代了什么,更重要的是,如何在通配符中“考虑”空格?
答案 0 :(得分:1)
要回答*
匹配的内容,您需要了解&#34; glob&#34; valuetype
,您可以阅读有关使用?valuetype
和here的信息。简而言之,*
匹配任意数量的任何字符,包括无。请注意,这与它在正则表达式中的使用非常不同,这意味着&#34;匹配前面的字符中没有一个或多个&#34;。
pattern
中的kwic()
参数在对文本进行标记后,每个标记匹配一个模式。即使包含在phrase()
函数中,它仍然只考虑与令牌匹配的序列。因此,您无法匹配空白(定义标记之间的边界),除非您实际将这些包含在标记的值本身内。
你怎么能这样做?像这样:
toksbi <- tokens(text, ngrams = 2, concatenator = " ")
# tokens from 1 document.
# text1 :
# [1] "I want" "want to" "to find" "find both" "both the"
# [6] "the greatest" "greatest cake" "cake of" "of the" "the world"
# [11] "world but" "but also" "also some" "some very" "very great"
# [16] "great cakes" "cakes but" "but I" "I want" "want to"
# [21] "to find" "find this" "this last" "last part" "part :"
# [26] ": isn't" "isn't it"
kwic(toksbi, "great*cake*", window = 2)
# [text1, 7] both the the greatest | greatest cake | cake of of the
# [text1, 16] some very very great | great cakes | cakes but but I
但建议使用kwic(text, phrase("great* cake*"))
的原始用法。