Quanteda:从R中的令牌创建ngram和skipgrams

时间:2018-08-25 07:09:54

标签: r n-gram quanteda

我一直在浏览R中的Quanteda程序包,无法完全弄清tokens_skipgrams的功能。以下是the example from the manual of this package,我不确定我是否理解得很好:

tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")   
tokens from 1 document.
text1 :
[1] "insurgents killed in"        "insurgents killed ongoing"  
[3] "insurgents killed fighting"  "insurgents in ongoing"      
[5] "insurgents in fighting"      "insurgents ongoing fighting"
[7] "killed in ongoing"           "killed in fighting"         
[9] "killed ongoing fighting"     "in ongoing fighting"        

我希望输出包含以下内容:

 "insurgents killed in"    "killed in ongoing"    "in ongoing fighting" 
 "insurgents in fighting"

为什么结果包括:

  "insurgents killed ongoing"  
  "insurgents killed fighting"  
  "insurgents in ongoing"      
  "insurgents ongoing fighting"
  "killed in fighting"         
  "killed ongoing fighting" 

在上面的示例中,skip = 0:2,即skip是0、1和2。因此,我认为上述命令可以安全地分成3部分,并且每一项的组合都会给我带来高于该结果的结果正如我指出的那样。

tokens_skipgrams(toks, n = 3, skip = 0, concatenator = " ")   
tokens from 1 document.
text1 :
[1] "insurgents killed in" "killed in ongoing"    "in ongoing fighting" 

tokens_skipgrams(toks, n = 3, skip = 1, concatenator = " ")   
tokens from 1 document.
text1 :
[1] "insurgents in fighting"


tokens_skipgrams(toks, n = 3, skip = 2, concatenator = " ")   
tokens from 1 document.
text1 :
character(0)

但是结果的组合给了我确切的期望,而不是上面给出的。

有没有人可以为我解决这个问题?

1 个答案:

答案 0 :(得分:2)

您观察到的行为是Guthrie等人(2006)对一个skiagram的定义的实现:“ k skip-gram是一个ngram,它是所有ngram的超集,每个ngram的超集( k -i)跳过报文,直到( k -i)== 0(其中包括0个跳过报文)。” (这是在?tokens_skipgram的Quanteda手册页上引用的。原始出处是 Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling."。)。以下s02的示例直接取自该论文,即所谓的“ 2-skip-tri-grams”。

但是,对于skip的标量值,为了给用户最大的控制权,没有实现跳过的递归实现。

这解释了如上所述提供skip值作为单个标度然后提供序列0:2的区别。对于

toks <- tokens("insurgents killed in ongoing fighting")
toks
# tokens from 1 document.
# text1 :
# [1] "insurgents" "killed"     "in"         "ongoing"    "fighting" 

我们在skip = 0:2时观察到诸如“叛乱分子杀死战斗”之类的组合,因为这包括0(“叛乱分子”与“被杀死”之间)和2(“杀死”与“战斗”之间)的跳跃。对于此处的短语,这意味着从skip = 0:1skip = 0:2仅存在两个附加的跳过图:

(s01 <- tokens_skipgrams(toks, n = 3, skip = 0:1, concatenator = " "))
# tokens from 1 document.
# text1 :
# [1] "insurgents killed in"      "insurgents killed ongoing" "insurgents in ongoing"    
# [4] "insurgents in fighting"    "killed in ongoing"         "killed in fighting"       
# [7] "killed ongoing fighting"   "in ongoing fighting"      

(s02 <- tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " "))
# tokens from 1 document.
# text1 :
# [1] "insurgents killed in"        "insurgents killed ongoing"   "insurgents killed fighting" 
# [4] "insurgents in ongoing"       "insurgents in fighting"      "insurgents ongoing fighting"
# [7] "killed in ongoing"           "killed in fighting"          "killed ongoing fighting"    
# [10] "in ongoing fighting"        

setdiff(as.character(s02), as.character(s01))
# [1] "insurgents killed fighting"  "insurgents ongoing fighting"