我正在使用VCorpus()
包r
中的tm
功能。这是我遇到的问题
example_text = data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
这看起来像
num Author1 Author2
1 1 Text mining is a great time. R is a great language
2 2 Text analysis provides insights R has many uses
3 3 qdap and tm are used in text mining here is a problem
然后我输入df_source = DataframeSource(example_text[,2:3])
以仅提取最后2列。
df_source
看起来很正确。之后,我df_corpus = VCorpus(df_source)
和df_corpus[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 2
df_corpus[[1]]
给了我
$content
[1] "3" "3"
但是df_corpus[[1]]
应该返回
<<PlainTextDocument>>
Metadata: 7
Content: chars: 49
df_corpus[[1]][1]
应该返回
$content
[1] "Text mining is a great time." "R is a great language"
我不知道哪里出了问题。任何建议将不胜感激。
答案 0 :(得分:0)
example_text
里面应该是角色的文字都已成为因素,因为stringsAsFactors
的'工厂新鲜'值是TRUE
,这很奇怪,从我的观点来看很烦人观点。
example_text <- data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
lapply(example_text, class)
# $num
# [1] "numeric"
#
# $Author1
# [1] "factor"
#
# $Author2
# [1] "factor"
要确保列Author1和Author2为字符列,您可以尝试:
options(stringsAsFactors = FALSE)
。stringsAsFactors = FALSE
声明中添加data.frame(...)
。example_text[, 2:3] <- lapply(example_text[, 2:3], as.character)
example_text[, 2:3] <- lapply(example_text[, 2:3], paste)
然后一切都应该正常。