Question

我想在dfm中保留2-3个单词短语（即功能），这些单词短语的PMI值大于短语中单词数量的3倍*。

PMI在此定义为：pmi（短语）= log（p（短语）/ Product（p（word））

与 p（phrase）：基于词组相对频率的词组概率 Product（p（word）：词组中每个单词的概率乘积。</ p>

到目前为止，我使用了以下代码，但是PMI值似乎不正确，但是我无法找到问题：

#creating dummy data
id <- c(1:5)
text <- c("positiveemoticon my name is positiveemoticon positiveemoticon i love you", "hello dont", "i love you", "i love you", "happy birthday")
ids_text_clean_test <- data.frame(id, text)
ids_text_clean_test$id <- as.character(ids_text_clean_test$id)
ids_text_clean_test$text <- as.character(ids_text_clean_test$text)

test_corpus <- corpus(ids_text_clean_test[["text"]], docnames = ids_text_clean_test[["id"]])

tokens_all_test <- tokens(test_corpus, remove_punct = TRUE)

## Create a document-feature matrix(dfm)
doc_phrases_matrix_test <- dfm(tokens_all_test, ngrams = 2:3) #extracting two- and three word phrases
doc_phrases_matrix_test

# calculating the pointwise mututal information for each phrase to identify phrases that occur at rates much higher than chance
tcmrs = Matrix::rowSums(doc_phrases_matrix_test) #number of words per user
tcmcs = Matrix::colSums(doc_phrases_matrix_test) #counts of each phrase
N = sum(tcmrs) #number of total words used 
colp = tcmcs/N #proportion of the phrases by total phrases
rowp = tcmrs/N #proportion of each users' words used by total words used
pp = doc_phrases_matrix_test@p + 1
ip = doc_phrases_matrix_test@i + 1
tmpx = rep(0,length(doc_phrases_matrix_test@x)) # new values go here, just a numeric vector
# iterate through sparse matrix:
for (i in 1:(length(doc_phrases_matrix_test@p) - 1) ) {
  ind = pp[i]:(pp[i + 1] - 1)
  not0 = ip[ind]
  icol = doc_phrases_matrix_test@x[ind]
  tmp = log( (icol/N) / (rowp[not0] * colp[i] )) # PMI
  tmpx[ind] = tmp
}

doc_phrases_matrix_test@x = tmpx
doc_phrases_matrix_test

我相信PMI不应因用户而在一个词组内变化，但我认为将PMI直接应用于dfm会更容易，因此更容易根据功能PMI对其进行子集化。

我尝试的另一种方法是将PMI直接应用于功能：

test_pmi <- textstat_keyness(doc_phrases_matrix_test,  measure =  "pmi",
                             sort = TRUE)
test_pmi

但是，首先，我在这里得到警告，警告说产生了NaN，其次，我不了解PMI值（例如，为什么会有负值）？

有人有更好的主意如何根据上面定义的PMI值吸引特征吗？

任何提示都非常感谢：）

*按照Park等人（2015年）

Answer 1

您可以使用以下R代码，该代码使用udpipe R包来获取您所要的内容。标记符号化data.frame的示例，它是udpipe的一部分

library(udpipe) 
data(brussels_reviews_anno, package = "udpipe") 
x <- subset(brussels_reviews_anno, language %in% "fr") 

## find keywords with PMI > 3 
keyw <- keywords_collocation(x, term = "lemma", 
                             group = c("doc_id", "sentence_id"), ngram_max = 3, n_min = 10) 
keyw <- subset(keyw, pmi > 3) 

## recodes to keywords 
x$term <- txt_recode_ngram(x$lemma, compound = keyw$keyword, ngram = keyw$ngram) 
## create DTM 
dtm <- document_term_frequencies(x = x$term, document = x$doc_id) 
dtm <- document_term_matrix(dtm)

如果要获取与x相似结构的数据集。只需使用udpipe（text，“ english”）或您选择的任何语言即可。如果您想使用Quanteda进行令牌化，您仍然可以将其放入更丰富的数据中。frame-给出了here和here的示例。借助udpipe R程序包的帮助，它具有许多小插图（？udpipe）。

请注意，PMI很有用，使用udpipe R包的依赖项解析输出要有用得多。如果您查看dep_rel字段，则会发现一些类别，这些类别可以标识多词表达式（例如，dep_rel fixed / flat / compound是http://universaldependencies.org/u/dep/index.html所定义的多词表达式），也可以使用它们将它们放入文档中/ term / matrix

R中基于语言的处理：在dfm中选择具有某些逐点互信息（PMI）值的特征

1 个答案: