Question

我正在使用quanteda字典查找。我将尝试制定条目，我可以在其中查找单词的逻辑组合。

例如：

Teddybear = (fluffy AND adorable AND soft)

这可能吗？我只找到了一个解决方案来测试像(Teddybear = (soft fluffy adorable))这样的短语。但它必须是文本中的确切短语匹配。但是，我怎样才能得到忽略单词顺序的结果呢？

Answer 1

目前， quanteda （v1.2.0）中无法直接使用此功能。但是，有一些解决方法可以创建字典序列，这些序列是所需序列的排列。这是一个这样的解决方案。

首先，我将创建一些示例文本。请注意，序列由＆＃34;，＆＃34;或＆＃34;和＆＃34;在某些情况下。此外，第三个文本只有两个短语而不是三个。（稍等一下。）

txt <- c("The toy was fluffy, adorable and soft, he said.",
         "The soft, adorable, fluffy toy was on the floor.",
         "The fluffy, adorable toy was shaped like a bear.")

现在，让我们生成一对函数，以便从向量生成置换序列和子序列。这些将使用 combinat 包中的一些功能。第一个是生成排列的内部函数，第二个是可以生成全长排列的主调用函数，或者是subsample_limit以下的任何子样本。（当然，为了更普遍地使用这些，我添加错误检查，但是我已经跳过了这个示例。）

genperms <- function(vec) {
    combs <- combinat::permn(vec)
    sapply(combs, paste, collapse = " ")
}

# vec any vector
# subsample_limit integer from 1 to length(vec), subsamples from
# which to return permutations; default is no subsamples
permutefn <- function(vec, subsample_limit = length(vec)) {
    ret <- character()
    for (i in length(vec):subsample_limit) {
        ret <- c(ret, 
                 unlist(lapply(combinat::combn(vec, i, simplify = FALSE), 
                               genperms)))
    }
    ret
}

演示这些是如何工作的：

fas <- c("fluffy", "adorable", "soft")
permutefn(fas)
# [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
# [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"

# and with subsampling:
permutefn(fas, 2)
#  [1] "fluffy adorable soft" "fluffy soft adorable" "soft fluffy adorable"
#  [4] "soft adorable fluffy" "adorable soft fluffy" "adorable fluffy soft"
#  [7] "fluffy adorable"      "adorable fluffy"      "fluffy soft"         
# [10] "soft fluffy"          "adorable soft"        "soft adorable"

现在使用tokens_lookup()将这些应用于文本。我通过设置remove_punct = TRUE来避免标点符号问题。为了显示未替换的原始令牌，我还使用了exclusive = FALSE。

tokens(txt, remove_punct = TRUE) %>%
    tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
                  exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The"      "toy"      "was"      "fluffy"   "adorable" "and"      "soft"    
# [8] "he"       "said"    
# 
# text2 :
# [1] "The"      "TEDDYBEAR" "toy"       "was"       "on"        "the"      
# [8] "floor"    
# 
# text3 :
# [1] "The"      "fluffy"   "adorable" "toy"      "was"      "shaped"   "like"    
# [8] "a"        "bear"

此处的第一个案例未被捕获，因为第二个和第三个元素由＆＃34;和＆＃34;分隔。我们可以使用tokens_remove()删除它，然后获取匹配项：

tokens(txt, remove_punct = TRUE) %>%
    tokens_remove("and") %>%
    tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas))),
                  exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The"       "toy"       "was"       "TEDDYBEAR" "he"        "said"     
# 
# text2 :
# [1] "The"       "TEDDYBEAR" "toy"       "was"       "on"        "the"       "floor"    
# 
# text3 :
# [1] "The"      "fluffy"   "adorable" "toy"      "was"      "shaped"   "like"    
# [8] "a"        "bear"

最后，为了匹配三个字典元素中只有两个存在的第三个文本，我们可以将2作为subsample_limit参数传递：

tokens(txt, remove_punct = TRUE) %>%
    tokens_remove("and") %>%
    tokens_lookup(dictionary = dictionary(list(teddybear = permutefn(fas, 2))), 
                  exclusive = FALSE)
# tokens from 3 documents.
# text1 :
# [1] "The"       "toy"       "was"       "TEDDYBEAR" "he"        "said"     
# 
# text2 :
# [1] "The"       "TEDDYBEAR" "toy"       "was"       "on"        "the"       "floor"    
# 
# text3 :
# [1] "The"       "TEDDYBEAR" "toy"       "was"       "shaped"    "like"      "a"        
# [8] "bear" 
#

Answer 2

如果您想知道哪些文件包含所有单词，请执行以下操作：

require(quanteda)

txt <- c("The toy was fluffy, adorable and soft, he said.",
         "The soft, adorable, fluffy toy was on the floor.",
         "The fluffy, adorable toy was shaped like a bear.")
dict <- dictionary(list(teddybear = list(c1 = "fluffy", c2 = "adorable", c3 = "soft")))

mt <- dfm_lookup(dfm(txt), dictionary = dict["teddybear"], levels = 2)

cbind(mt, "teddybear" = as.numeric(rowSums(mt > 0) == length(dict[["teddybear"]])))

# Document-feature matrix of: 3 documents, 4 features (16.7% sparse).
# 3 x 4 sparse Matrix of class "dfm"
# features
# docs    c1 c2 c3 teddybear
# text1  1  1  1         1
# text2  1  1  1         1
# text3  1  1  0         0

quanteda词典中的逻辑组合

2 个答案: