Question

以这个问题为基础：Interpretation of dfm_weight(scheme='prop') with groups (quanteda)

如果我具有以下功能：

     plot_topterms = function(data,text_field,n,...){

  corp=corpus(data,text_field = text_field) %>% 
    dfm(remove_numbers=T,remove_punct=T,remove=c(stopwords('english')),ngrams=1:2) %>%
    dfm_weight(scheme ='prop') %>% 
    dfm_group(groups=...) %>% 
    dfm_replace(pattern=as.character(lemma$first),replacement = as.character(lemma$X1)) %>% 
    dfm_remove(pattern = c(paste0("^", stopwords("english"), "_"), paste0("_", stopwords("english"), "$")), valuetype = "regex") %>% 
    dfm_remove(toRemove)
  freq_weight <- textstat_frequency(corp, n = n)

  ggplot(data = freq_weight, aes(x = nrow(freq_weight):1, y = frequency)) +
    geom_bar(stat='identity')+
    facet_wrap(~ group, scales = "free") +
    coord_flip() +
    scale_x_continuous(breaks = nrow(freq_weight):1,
                       labels = freq_weight$feature) +
    #scale_y_continuous(labels = scales::percent)+
    theme(text = element_text(size=20))+
    labs(x = NULL, y = "Relative frequency")
}

而且我没有传递分组变量，所以我做类似的事情：

plot_topterms(df,textField,n=10)

我得到的输出具有等于all的组变量。这应该等效于甚至没有正确设置dfm_group行吗？如果是这样，如果我对单词fun的相对频率是60，这是否意味着所有文档中有60％包含该单词？

Answer 1

您对“所有”组的解释是正确的。在groups中未指定textstat_frequency()的结果是该组将默认为“全部”。在函数中，即使在函数内部通过groups调用对dfm进行了分组，也永远不会在调用此函数时传递dfm_group()参数，因此该参数始终为“ all” plot_topterms()。

此图中某个要素的值为60意味着该要素的相对项频率（在文档中）的总和为60。如果您查看the question you reference above，那么您将看到它如何用于简单的例子。 a在text1中的相对频率是0.20，在text2中是0.67，因此textstat_frequency()将这两个值相加为0.87。您的60就是这个0.87。

这与文档频率不相同相同，文档频率是发生功能（至少一次）的文档数。如果您想了解要素的文档频率（这是您的解释），则应该从docfreq返回中绘制textstat_frequency，而不是frequency。

但是我要注意，plot_topterms()不是一个精心设计的函数。

它依赖于几个不是函数本地变量的变量，即toRemove和lemma。
它将无法在...调用中正确传递dfm_group()。您应该在函数签名中显式指定一个groups参数。

如果我们正在为程序包设计一个新函数，我们将创建一个新函数textplot_frequency()，该函数绘制了一个textstat_frequency()的返回值，该返回值基本上是在用户执行之后执行ggplot()建立textstat_frequency对象。这样可以更聪明地使用每个textstat_frequency对象中内置的组变量，以便那些唯一的组为“全部”的对象会将其绘制为单个构面。

了解dfm_groups如何工作而不添加任何组

1 个答案: