在文档特征矩阵中拆分ngrams(quanteda)

时间:2017-05-24 12:48:48

标签: r quanteda

我很奇怪是否可以在文档特征矩阵(dfm)中以这样的方式分割ngram-features。一个二元组导致两个单独的unigrams?

head(dfm, n = 3, nfeature = 4)

docs       in_the great plenary emission_reduction
  10752099      3     1       1                  3
  10165509      8     0       0                  3
  10479890      4     0       0                  1

所以,上面的dfm会产生这样的结果:

head(dfm, n = 3, nfeature = 4)

docs       in great plenary emission the reduction
  10752099  3     1       1        3   3         3
  10165509  8     0       0        3   8         3
  10479890  4     0       0        1   4         1

为了更好地理解:我在dfm中得到了ngrams,将特征从德语翻译成英语。化合物(“Emissionsminderung”)在德语中很常见,但不是英语(“减排”)。

提前谢谢!

编辑:以下内容可用作可重复的示例。

library(quanteda)

eg.txt <- c('increase in_the great plenary', 
            'great plenary emission_reduction', 
            'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)

head(eg.dfm)

1 个答案:

答案 0 :(得分:0)

我不知道最好的方法(它可能会使用大量的RAM,因为它会将稀疏dfm变为data.frame/matrix),但它应该有效:

# turn the dft into a matrix (transposing it)
DF <- as.data.frame(eg.dfm)
MX <- t(DF)
# split the current column names by '_'
colsSplit <- strsplit(colnames(DF),'_')
# replicate the rows of the matrix and give them the new split row names
MX <-MX[unlist(lapply(1:length(colsSplit),function(idx) rep(idx,length(colsSplit[[idx]])))),]
rownames(MX) <- unlist(colsSplit)
# aggregate the matrix rows having the same name and transpose again
MX2 <- t(do.call(rbind,by(MX,rownames(MX),colSums)))
# turn the matrix into a dfm
eg.dfm.res <- as.dfm(MX2)

结果:

> eg.dfm.res
Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
3 x 7 sparse Matrix of class "dfmSparse"
       features
docs    emission great in increase plenary reduction the
  text1        0     1  1        1       1         0   1
  text2        1     1  0        0       1         1   0
  text3        2     0  1        2       0         1   1