See the below code, especially the last line of code:
library(dplyr)
library(qdap)
library(tm)
comments <- read.csv(file = 'c:/raj/r/Toxic Comment Classification/train.csv', header = T, stringsAsFactors = F)
comments %>% glimpse()
# convert to df source for VCorpus
comment_df_source <- comments %>%
rename(doc_id = id, text = comment_text) %>%
tm::DataframeSource()
# create VCorpus
comment_corpus <- comment_df_source %>% tm::VCorpus()
#Results in
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 6
# Content: documents: 1
comment_corpus[1]
#Results in
# toxic severe_toxic obscene threat insult identity_hate
# 1 1 0 0 0 0 0
meta(comment_corpus[1])
#Results in FALSE
comment_corpus[1] %>% (function(x) meta(x)$toxic == 0)
#Results in TRUE
comment_corpus[1] %>% (function(x) meta(x)$toxic == 1)
#Results in
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 6
# Content: documents: 0
tm_filter(comment_corpus[1], FUN = function(x) meta(x)$toxic == 1)
tm_filter(comment_corpus[1], FUN = function(x) meta(x)[['toxic']] == 1)
The last 2 lines(variations) keep return the wrong output. I'm not sure what I did wrong. I read the docs carefully. Please help.
Raj
答案 0 :(得分:0)
我对你的例子进行了一些扩展,因此语料库中有2个文档。无需使用tm_filter
。 tm_filter
更适合在文档内搜索。
您可以使用meta
功能直接过滤语料库。
library(tm)
library(dplyr)
df <- data.frame(doc_id = c(1, 2), text = c('abc', 'def'), toxic = c(1,0), insult = c(0, 1))
corp <- df %>% DataframeSource() %>% VCorpus()
meta(corp)
toxic insult
1 1 0
2 0 1
toxic_corp <- corp[meta(corp, "toxic") == 1]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 2
Content: documents: 1
meta(toxic_corp)
toxic insult
1 1 0