Question

See the below code, especially the last line of code:

library(dplyr)
library(qdap)
library(tm)

comments <- read.csv(file = 'c:/raj/r/Toxic Comment Classification/train.csv', header = T, stringsAsFactors = F)
comments %>% glimpse()

# convert to df source for VCorpus
comment_df_source <- comments %>% 
  rename(doc_id = id, text = comment_text) %>% 
  tm::DataframeSource()

# create VCorpus
comment_corpus <- comment_df_source %>% tm::VCorpus()

#Results in 
# <<VCorpus>>
#   Metadata:  corpus specific: 0, document level (indexed): 6
#   Content:  documents: 1
comment_corpus[1]

#Results in 
# toxic severe_toxic obscene threat insult identity_hate
# 1     1            0       0      0      0             0
meta(comment_corpus[1])

#Results in FALSE
comment_corpus[1] %>% (function(x) meta(x)$toxic == 0)

#Results in TRUE
comment_corpus[1] %>% (function(x) meta(x)$toxic == 1)

#Results in 
# <<VCorpus>>
#   Metadata:  corpus specific: 0, document level (indexed): 6
#   Content:  documents: 0
tm_filter(comment_corpus[1], FUN = function(x) meta(x)$toxic == 1)
tm_filter(comment_corpus[1], FUN = function(x) meta(x)[['toxic']] == 1)

The last 2 lines(variations) keep return the wrong output. I'm not sure what I did wrong. I read the docs carefully. Please help.

Raj

Answer 1

我对你的例子进行了一些扩展，因此语料库中有2个文档。无需使用tm_filter。 tm_filter更适合在文档内搜索。

您可以使用meta功能直接过滤语料库。

library(tm)
library(dplyr)

df <- data.frame(doc_id = c(1, 2), text = c('abc', 'def'), toxic = c(1,0), insult = c(0, 1))
corp <- df %>% DataframeSource() %>% VCorpus()


meta(corp)

     toxic insult
1     1      0
2     0      1

toxic_corp <- corp[meta(corp, "toxic") == 1]
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 2
Content:  documents: 1

meta(toxic_corp)
  toxic insult
1     1      0

tm_filter from tm package in R giving incorrect results

1 个答案: