将Dataframe列合并到Document Term Matrix中

时间:2017-09-28 13:20:51

标签: r

我有dataframe包含我的所有数据,大约26k条目。由此,我创建了三个子集;测试,培训和验证数据集。这些集合中的每一个都生成了相应的corpusesdocument term matrices

原始数据矩阵的示例如下。如您所见,唯一ID是第4列。

Raw data table

下面是一个文档术语矩阵的示例。行名显然对应于上面原始数据中的ID列tweet_ID。

Document term matrix

我要做的是使用唯一ID(文档术语矩阵中的行名称)来引用原始数据矩阵以提​​取标签。标签是原始数据中的“立场”列,第10列。

然后,我想将标签附加到我的文档术语矩阵的附加列,在术语列之前或之后。

我尝试过使用mergegrepl,但我遇到了与行名称问题有关的各种错误。我查看了this question,但使用merge的解决方案无效。

到目前为止,我的代码工作正常,但我已将其包含在下方,以防任何人使用它。

主要脚本

#Imports
getImports <- function(){
    library(jsonlite)
    library(tm)
    library(fpc)
    library(stringr)
    library(SnowballC)
}

#Main Program
run <- function(){
    #Includes
  source("./Classifier_Functions.R")

  #Get the raw data
  rawData <- data.frame(getInput())

  rawData$tweet <- sapply(rawData$tweet,function(row) iconv(row, "latin1", "ASCII", sub=""))

  # Set custom reader
  aReader <- readTabular(mapping=list(content="tweet", id="tweet_ID"))

  #Set the data specification to 70% training, 20% test and 10% validate
  spec = c(train = .7, test = .2, validate = .1)

  #Set the seed for reproducable results when using random generators
  set.seed(1)

  #Sample the raw data
  g=sample(cut(
    seq(nrow(rawData)), 
    nrow(rawData)*cumsum(c(0,spec)), 
    labels = names(spec)
  ))

  #Split the raw data into the three groups specified
  groups = split(rawData, g)

  #Generate clean corpuses for all data sets
  testCorpus <- createCorpus(groups$test,aReader)
  trainCorpus <- createCorpus(groups$train,aReader)
  validateCorpus <- createCorpus(groups$validate,aReader)

  #Produce 'bag of words' Document Term Matrices for test, training and validation data
  testDTM <- getDocumentTermMatrix(testCorpus)
  trainDTM <- getDocumentTermMatrix(trainCorpus)
  validateDTM <- getDocumentTermMatrix(validateCorpus)

  #Need to extract labels from raw data and append to document term matrices here.
}

功能文件

#Get raw input data from JSON file
getInput <- function() {   
   #Import data   
   json_file <-"path_to_file"
   #Set data to dataframe   
   frame <- fromJSON(json_file)   
   cat("\nImported data.")   
   return(frame) 
}

#Produces a corpus 
createCorpus <- function(data, r){   
    corpus <- VCorpus(DataframeSource(data), readerControl = list(reader = r))
    corpus <- tm_map(corpus, content_transformer(tolower))   cat("\nTransformed to lowercase.")
    corpus <- tm_map(corpus, stripWhitespace)   cat("\nWhitespace stripped.")
    corpus <- tm_map(corpus, removePunctuation)   cat("\nPunctuation removed.")
    corpus <- tm_map(corpus, removeNumbers)   cat("\nNumbers removed.")
    corpus <- tm_map(corpus, removeWords, stopwords('english'))   cat("\nStopwords removed.")
    corpus <- tm_map(corpus, stemDocument)   cat("\nStemming complete.")
    cat("\nCorpus creattion complete.")   return(corpus) }

#Generate Document Term Matrix 
getDocumentTermMatrix<-function(corpus){   
    dtm <- DocumentTermMatrix(corpus)   
    dtm<-removeSparseTerms(dtm, 0.9)   
    cat("\nDTM produced.")   
    return(dtm) 
}

#Generate tf-idf Document Term Matrix 
getTfIdfMatrix<-function(corpus){   
    tfIdf <-DocumentTermMatrix(corpus, control=list(weighting = weightTfIdf()))   
    cat("\nTf-Idf computation complete.")   
    return(tfIdf) 
}

编辑:根据评论中的建议,可以在this Google Drive folder中找到用于重新创建原始数据的dput输出和一个示例DTM。 dput()输出太大了,我无法在此处以纯文本格式分享

编辑2 :尝试使用merge包含

merge(rawData[c(10)], data.frame(as.matrix(testDTM)), by="tweet_ID", all.x=TRUE) 

merge(rawData[c(10)], data.frame(as.matrix(testDTM)), by.x="tweet_ID", by.y="row.names")

编辑3 :为原始数据的前5行添加了纯文本dput()

原始数据

structure(list(numRetweets = c(1L, 339L, 1L, 179L, 0L), numFavorites = c(2L,  178L, 2L, 152L, 0L), username = c("iainastewart", "DavidNuttallMP",  "DavidNuttallMP", "DavidNuttallMP", "DavidNuttallMP"), tweet_ID = c("745870298600316929",  "740663385214324737", "741306107059130368", "742477469983363076",  "743146889596534785"), tweet_length = c(140L, 118L, 140L, 139L,  63L), tweet = c("RT @carolemills77: Many thanks to all the @mkcouncil #EUref staff who are already in the polling stations ready to open at 7am and the Elec",  "RT @BetterOffOut: If you agree with @DanHannanMEP, please RT. #VoteLeave #Brexit #BetterOffOut ",  "RT @iaingartside: Out with @DavidNuttallMP @DeneVernon @CllrSueNuttall Campaigning to \"Leave\" in the EU ref in Bury Today #Brexit https://t",  "RT @simplysimontfa: Don't be distracted by good opinion polls 4 Leave; the only way to get our country back is to maximise the Brexit vote",  "@GrumpyPete Just a little light relief #BetterOffOut #VoteLeave" ), number_hashtags = c(1L, 3L, 1L, 1L, 2L), number_URLs = c(0L,  0L, 0L, 0L, 0L), sentiment_score = c(2L, 2L, -1L, 0L, 0L), stance = c("leave",  "leave", "leave", "leave", "leave")), .Names = c("numRetweets",  "numFavorites", "username", "tweet_ID", "tweet_length", "tweet",  "number_hashtags", "number_URLs", "sentiment_score", "stance" ), row.names = c(NA, 5L), class = "data.frame")

文档术语矩阵

0 个答案:

没有答案