我正在尝试计算列总数与所有列总数的比例。
我正在使用的文档术语矩阵非常大,导致我运行错误的任何测试基本上都会导致Rstudio崩溃。
到目前为止,这是我的工作代码:
randomSample = read.csv("randomSample2016.csv", stringsAsFactors = FALSE)
str(randomSample)
randomSample$tweet <- as.character(randomSample$tweet)
randomSample$tweetlength <- nchar(randomSample$tweet)
hist(randomSample$tweetlength)
library("tm")
## Use the tm library to construct a document-term matrix of term
frequencies
randomSample_corpus <- Corpus(VectorSource(randomSample$tweet))
print(randomSample_corpus)
inspect(randomSample_corpus[1:3])
#clean up corpus
#make all letters lowercase
randomSample_corpus_clean <- tm_map(randomSample_corpus, tolower)
#Remove Numbers
randomSample_corpus_clean <- tm_map(randomSample_corpus_clean,
removeNumbers)
#Remove punctuation
randomSample_corpus_clean <- tm_map(randomSample_corpus_clean,
removePunctuation)
#Remove stop words
randomSample_corpus_clean <- tm_map(randomSample_corpus_clean, removeWords,
stopwords())
#remove unneeded whitespace
randomSample_corpus_clean <- tm_map(randomSample_corpus_clean,
stripWhitespace)
#Inspect cleaned corpus
inspect(randomSample_corpus_clean[1:3])
#Create document term matrix
randomSample_dtm <- DocumentTermMatrix(randomSample_corpus_clean)
#convert to R matrix
dtm2 <- as.matrix(randomSample_dtm)
#obtain individual word frequencies
frequency <- colSums(dtm2)
这允许我获取dtm2中所有单词的总频率,但是当我尝试为列总计(dtm2$newcolumn <- 0
)添加新行时,我最终会使用疯狂的ram使用量。
答案 0 :(得分:0)
那是因为你的################################################################################
# Get path of file selected in Finder
################################################################################
getsel() {
osascript <<EOF
tell application "Finder" to return POSIX path of first item of (get selection as alias list)
EOF
}
返回了一个命名数组。你想要的是
colSums