价值相同但结果不同?关于removeSparseTerms(R)

时间:2016-10-04 12:29:09

标签: r

首先,这里是重现问题的示例数据,我将尝试解释如下: https://drive.google.com/file/d/0B4RCdYlVF8otUll6V2x0cDJORGc/view?usp=sharing

问题是我从removeSparseTerms获得了不同的结果,尽管它引入了相同的值。它似乎违背了人类的逻辑,或者至少是我的逻辑。我有这个功能:

generateTDM <- function (Room_name, dest.train, RST){
          s.dir <- sprintf("%s/%s", dest.train, Room_name)
          s.cor <- Corpus(DirSource(directory = s.dir, pattern = "txt", encoding = "UTF-8"))                  #Crea unos corpora de los archivos txt ya limpios.
          s.tdm <- TermDocumentMatrix(s.cor, control = list(bounds = list(local = c(2, Inf)), tokenize = TrigramTokenizer))                     #Crea una matriz de terminos a partir de los corpora teniendo en cuenta unigramas, bigramas y trigramas.
          s.tdm <- removeSparseTerms(s.tdm, RST)                                                           #Mantiene aquellos términos que aparezcan en el (1-RST)% de los archivos, el resto los elimina.
      }

好吧,当我以这种方式调用此函数时:

tdm.train <- lapply(Room_name, generateTDM, dest.train, RST[p])

根据其他元素,我在函数中得到不同的输出,其中变量RST位于向量内部。也就是说,尽管价值相同,但我得到了明显的结果。

例如:

案例1:

RST <-seq (0.45, 0.6, 0.05)
p<-4

我将RST =(0.45,0.5,0.55,0.6),然后RST [p]为0.6。

在这种情况下的结果:

    > tdm.train 
        [[1]]
    <<TermDocumentMatrix (terms: 84, documents: 51)>>
    Non-/sparse entries: 2451/1833
    Sparsity           : 43%
    Maximal term length: 10
    Weighting          : term frequency (tf)

    [[2]]
    <<TermDocumentMatrix (terms: 82, documents: 52)>>
    Non-/sparse entries: 2409/1855
    Sparsity           : 44%
    Maximal term length: 11
    Weighting          : term frequency (tf)

    [[3]]
    <<TermDocumentMatrix (terms: 68, documents: 51)>>
    Non-/sparse entries: 1926/1542
    Sparsity           : 44%
    Maximal term length: 13
    Weighting          : term frequency (tf)

    [[4]]
    <<TermDocumentMatrix (terms: 36, documents: 48)>>
    Non-/sparse entries: 985/743
    Sparsity           : 43%
    Maximal term length: 10
    Weighting          : term frequency (tf)

    [[5]]
    <<TermDocumentMatrix (terms: 48, documents: 50)>>
    Non-/sparse entries: 1295/1105
    Sparsity           : 46%
    Maximal term length: 10
    Weighting          : term frequency (tf)

    [[6]]
    <<TermDocumentMatrix (terms: 27, documents: 50)>>
    Non-/sparse entries: 756/594
    Sparsity           : 44%
    Maximal term length: 8
    Weighting          : term frequency (tf)

案例2:

    RST <-seq (0.45, 0.8, 0.05)
    p<-4

我现在已经知道RST =(0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8),这次ergo RST [p]是相同的(0.6)。

然后,为什么我有不同的结果?我无法理解。

> tdm.train
[[1]]
<<TermDocumentMatrix (terms: 84, documents: 51)>>
Non-/sparse entries: 2451/1833
Sparsity           : 43%
Maximal term length: 10
Weighting          : term frequency (tf)

[[2]]
<<TermDocumentMatrix (terms: 82, documents: 52)>>
Non-/sparse entries: 2409/1855
Sparsity           : 44%
Maximal term length: 11
Weighting          : term frequency (tf)

[[3]]
<<TermDocumentMatrix (terms: 68, documents: 51)>>
Non-/sparse entries: 1926/1542
Sparsity           : 44%
Maximal term length: 13
Weighting          : term frequency (tf)

[[4]]
<<TermDocumentMatrix (terms: 36, documents: 48)>>
Non-/sparse entries: 985/743
Sparsity           : 43%
Maximal term length: 10
Weighting          : term frequency (tf)

[[5]]
<<TermDocumentMatrix (terms: 57, documents: 50)>>
Non-/sparse entries: 1475/1375
Sparsity           : 48%
Maximal term length: 10
Weighting          : term frequency (tf)

[[6]]
<<TermDocumentMatrix (terms: 34, documents: 50)>>
Non-/sparse entries: 896/804
Sparsity           : 47%
Maximal term length: 8
Weighting          : term frequency (tf)

我不知道......这很奇怪,对吗?如果RST的值相同,那么为什么最后两个目录中的removeSparseTerms的结果在每种情况下都不同。请帮助我,不要知道原因是在杀我。

非常感谢你,祝你有个美好的一天。

基于OP更新的可重复示例:

library(tm)
library(RWeka)
download.file("https://docs.google.com/uc?authuser=0&id=0B4RCdYlVF8otUll6V2x0cDJORGc&export=download", tf <- tempfile(fileext = ".zip"), mode = "wb")
unzip(tf, exdir = tempdir())
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
generateTDM <- function (Room_name, dest.train, rst){
  s.dir <- sprintf("%s/%s", dest.train, Room_name)
  s.cor <- Corpus(DirSource(directory = s.dir, pattern = "txt", encoding = "UTF-8"))                  #Crea unos corpora de los archivos txt ya limpios.
  s.tdm <- TermDocumentMatrix(s.cor, control = list(bounds = list(local = c(2, Inf)), tokenize = TrigramTokenizer))                     #Crea una matriz de terminos a partir de los corpora teniendo en cuenta unigramas, bigramas y trigramas.
  t <- table(s.tdm$i) > (s.tdm$ncol * (1 - rst)) # from tm::removeSparseTerms()
  termIndex <- as.numeric(names(t[t]))
  return(s.tdm[termIndex, ])
}
dest.train <- file.path(tempdir(), "stackoverflow", "TrainDocs")
Room_name <- "Venus"
p <- 4
RST1 <- seq(0.45, 0.6, 0.05)
RST2 <- seq(0.45, 0.8, 0.05)
RST2[p]
# [1] 0.6
RST1[p]
# [1] 0.6
identical(RST2[p], RST1[p])
# [1] FALSE # ?!?

lapply(Room_name, generateTDM, dest.train, RST1[p])
# <<TermDocumentMatrix (terms: 48, documents: 50)>>

lapply(Room_name, generateTDM, dest.train, RST2[p])
# <<TermDocumentMatrix (terms: 57, documents: 50)>> # ?!?

1 个答案:

答案 0 :(得分:0)

这个问题似乎与流行的问题“7.31 Why doesn’t R think these numbers are equal?”有关:

  

唯一可以用R的数字类型精确表示的数字   是分数为2的幂的整数和分数。所有   其他数字在内部舍入到(通常)53个二进制数字   准确性。结果,两个浮点数不可靠   除非它们是由相同的算法计算的,否则相等,而不是   总是那时候

给出

(x <- seq(0.45, 0.6, 0.05))
# [1] 0.45 0.50 0.55 0.60
(y <- seq(0.45, 0.8, 0.05))
# [1] 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80

然后

x==y
# [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
x[4]==y[4]
# [1] FALSE
x[4]-y[4]
# [1] -1.110223e-16
x[3]-y[3]
# [1] 0

由于

MASS::as.fractions(x)
# [1]  9/20   1/2 11/20   3/5

我猜这两个.5在这里是可靠的。因此,您的功能可能会产生不同的结果。