我已将CSV文件导入R中的数据框,其中一列包含Text。
我想对文字进行分析。我该怎么办呢?
我尝试制作一个只包含文本列的新数据框。
OnlyTXT= Txtanalytics1 %>%
select(problem_note_text)
View(OnlyTXT).
答案 0 :(得分:4)
这可以让你开始。
install.packages("gtools", dependencies = T)
library(gtools) # if problems calling library, install.packages("gtools", dependencies = T)
library(qdap) # qualitative data analysis package (it masks %>%)
library(tm) # framework for text mining; it loads NLP package
library(Rgraphviz) # depict the terms within the tm package framework
library(SnowballC); library(RWeka); library(rJava); library(RWekajars) # wordStem is masked from SnowballC
library(Rstem) # stemming terms as a link from R to Snowball C stemmer
以下假设您的文本变量(您的OnlyTXT)位于数据框中" df"标有"文字"。
df$text <- as.character(df$text) # to make sure it is text
# prepare the text by lower casing, removing numbers and white spaces, punctuation and unimportant words. The `tm::`prefix is being cautious.
df$text <- tolower(df$text)
df$text <- tm::removeNumbers(df$text)
df$text <- str_replace_all(df$text, " ", "") # replace double spaces with single space
df$text <- str_replace_all(df$text, pattern = "[[:punct:]]", " ")
df$text <- tm::removeWords(x = df$text, stopwords(kind = "SMART"))
corpus <- Corpus(VectorSource(df$text)) # turn into corpus
tdm <- TermDocumentMatrix(corpus) # create tdm from the corpus
freq_terms(text.var = df$text, top = 25) # find the 25 most frequent words
您可以使用tm
包或qdap
包进行更多操作。