我有dataframe
包含我的所有数据,大约26k条目。由此,我创建了三个子集;测试,培训和验证数据集。这些集合中的每一个都生成了相应的corpuses
和document term matrices
。
原始数据矩阵的示例如下。如您所见,唯一ID是第4列。
下面是一个文档术语矩阵的示例。行名显然对应于上面原始数据中的ID列tweet_ID。
我要做的是使用唯一ID(文档术语矩阵中的行名称)来引用原始数据矩阵以提取标签。标签是原始数据中的“立场”列,第10列。
然后,我想将标签附加到我的文档术语矩阵的附加列,在术语列之前或之后。
我尝试过使用merge
和grepl
,但我遇到了与行名称问题有关的各种错误。我查看了this question,但使用merge
的解决方案无效。
到目前为止,我的代码工作正常,但我已将其包含在下方,以防任何人使用它。
主要脚本
#Imports
getImports <- function(){
library(jsonlite)
library(tm)
library(fpc)
library(stringr)
library(SnowballC)
}
#Main Program
run <- function(){
#Includes
source("./Classifier_Functions.R")
#Get the raw data
rawData <- data.frame(getInput())
rawData$tweet <- sapply(rawData$tweet,function(row) iconv(row, "latin1", "ASCII", sub=""))
# Set custom reader
aReader <- readTabular(mapping=list(content="tweet", id="tweet_ID"))
#Set the data specification to 70% training, 20% test and 10% validate
spec = c(train = .7, test = .2, validate = .1)
#Set the seed for reproducable results when using random generators
set.seed(1)
#Sample the raw data
g=sample(cut(
seq(nrow(rawData)),
nrow(rawData)*cumsum(c(0,spec)),
labels = names(spec)
))
#Split the raw data into the three groups specified
groups = split(rawData, g)
#Generate clean corpuses for all data sets
testCorpus <- createCorpus(groups$test,aReader)
trainCorpus <- createCorpus(groups$train,aReader)
validateCorpus <- createCorpus(groups$validate,aReader)
#Produce 'bag of words' Document Term Matrices for test, training and validation data
testDTM <- getDocumentTermMatrix(testCorpus)
trainDTM <- getDocumentTermMatrix(trainCorpus)
validateDTM <- getDocumentTermMatrix(validateCorpus)
#Need to extract labels from raw data and append to document term matrices here.
}
功能文件
#Get raw input data from JSON file
getInput <- function() {
#Import data
json_file <-"path_to_file"
#Set data to dataframe
frame <- fromJSON(json_file)
cat("\nImported data.")
return(frame)
}
#Produces a corpus
createCorpus <- function(data, r){
corpus <- VCorpus(DataframeSource(data), readerControl = list(reader = r))
corpus <- tm_map(corpus, content_transformer(tolower)) cat("\nTransformed to lowercase.")
corpus <- tm_map(corpus, stripWhitespace) cat("\nWhitespace stripped.")
corpus <- tm_map(corpus, removePunctuation) cat("\nPunctuation removed.")
corpus <- tm_map(corpus, removeNumbers) cat("\nNumbers removed.")
corpus <- tm_map(corpus, removeWords, stopwords('english')) cat("\nStopwords removed.")
corpus <- tm_map(corpus, stemDocument) cat("\nStemming complete.")
cat("\nCorpus creattion complete.") return(corpus) }
#Generate Document Term Matrix
getDocumentTermMatrix<-function(corpus){
dtm <- DocumentTermMatrix(corpus)
dtm<-removeSparseTerms(dtm, 0.9)
cat("\nDTM produced.")
return(dtm)
}
#Generate tf-idf Document Term Matrix
getTfIdfMatrix<-function(corpus){
tfIdf <-DocumentTermMatrix(corpus, control=list(weighting = weightTfIdf()))
cat("\nTf-Idf computation complete.")
return(tfIdf)
}
编辑:根据评论中的建议,可以在this Google Drive folder中找到用于重新创建原始数据的dput
输出和一个示例DTM。 dput()
输出太大了,我无法在此处以纯文本格式分享。
编辑2 :尝试使用merge
包含
merge(rawData[c(10)], data.frame(as.matrix(testDTM)), by="tweet_ID", all.x=TRUE)
和
merge(rawData[c(10)], data.frame(as.matrix(testDTM)), by.x="tweet_ID", by.y="row.names")
编辑3 :为原始数据的前5行添加了纯文本dput()
。
原始数据
structure(list(numRetweets = c(1L, 339L, 1L, 179L, 0L), numFavorites = c(2L, 178L, 2L, 152L, 0L), username = c("iainastewart", "DavidNuttallMP", "DavidNuttallMP", "DavidNuttallMP", "DavidNuttallMP"), tweet_ID = c("745870298600316929", "740663385214324737", "741306107059130368", "742477469983363076", "743146889596534785"), tweet_length = c(140L, 118L, 140L, 139L, 63L), tweet = c("RT @carolemills77: Many thanks to all the @mkcouncil #EUref staff who are already in the polling stations ready to open at 7am and the Elec", "RT @BetterOffOut: If you agree with @DanHannanMEP, please RT. #VoteLeave #Brexit #BetterOffOut ", "RT @iaingartside: Out with @DavidNuttallMP @DeneVernon @CllrSueNuttall Campaigning to \"Leave\" in the EU ref in Bury Today #Brexit https://t", "RT @simplysimontfa: Don't be distracted by good opinion polls 4 Leave; the only way to get our country back is to maximise the Brexit vote", "@GrumpyPete Just a little light relief #BetterOffOut #VoteLeave" ), number_hashtags = c(1L, 3L, 1L, 1L, 2L), number_URLs = c(0L, 0L, 0L, 0L, 0L), sentiment_score = c(2L, 2L, -1L, 0L, 0L), stance = c("leave", "leave", "leave", "leave", "leave")), .Names = c("numRetweets", "numFavorites", "username", "tweet_ID", "tweet_length", "tweet", "number_hashtags", "number_URLs", "sentiment_score", "stance" ), row.names = c(NA, 5L), class = "data.frame")
文档术语矩阵