使用VCorpus()函数但丢失内容

时间:2017-01-25 02:06:47

标签: r tm

我正在使用VCorpus()r中的tm功能。这是我遇到的问题

example_text = data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))

这看起来像

num                             Author1               Author2
1   1        Text mining is a great time. R is a great language
2   2     Text analysis provides insights       R has many uses
3   3 qdap and tm are used in text mining     here is a problem

然后我输入df_source = DataframeSource(example_text[,2:3])以仅提取最后2列。

df_source看起来很正确。之后,我df_corpus = VCorpus(df_source)df_corpus[[1]]

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 2

df_corpus[[1]]给了我

$content
[1] "3" "3"

但是df_corpus[[1]]应该返回

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 49

df_corpus[[1]][1]应该返回

$content
[1] "Text mining is a great time." "R is a great language"

我不知道哪里出了问题。任何建议将不胜感激。

1 个答案:

答案 0 :(得分:0)

example_text里面应该是角色的文字都已成为因素,因为stringsAsFactors的'工厂新鲜'值是TRUE,这很奇怪,从我的观点来看很烦人观点。

example_text <- data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
lapply(example_text, class)

# $num
# [1] "numeric"
# 
# $Author1
# [1] "factor"
# 
# $Author2
# [1] "factor"

要确保列Author1和Author2为字符列,您可以尝试:

  1. 在代码的开头添加options(stringsAsFactors = FALSE)
  2. stringsAsFactors = FALSE声明中添加data.frame(...)
  3. 运行example_text[, 2:3] <- lapply(example_text[, 2:3], as.character)
  4. 运行example_text[, 2:3] <- lapply(example_text[, 2:3], paste)
  5. 然后一切都应该正常。