Question

我正在使用mapreduce执行一些分析。文本文件只有20MB。配置的HDFS为1.95TB。此设置仅用于学习目的。 map reduce用R写入并使用Rhipe。作业读取文件，删除停用词，计算单词，还做一些正则表达式的东西。复制因子设置为1，如hadoop fsck /中的默认复制因子所示。但是，它耗尽了我所有的资源。我认为记忆力很好，但工作在8-9个小时后就会停滞不前。当我使用df -h查看每个节点时，它告诉我该驱动器100％用于所有节点。我也尝试在namenode服务器中设置datanode（缺少资源），但它也用完了（我知道这是一个坏主意）。我只是不明白20MB文件提取和计数如何占用2TB的空间。我还可以做些什么？我还将复制间隔设置为216,000，但它仍然耗尽了所有资源。 HDFS数据位于/home/data目录中。我使用du -sh找出存储的作业文件的目录。

我不确定要包含哪些其他信息，因此，如果您让我知道应该包含哪些统计信息或信息，我会立即将其包含在内。但是现在，它似乎有很多信息，但不知道哪些是相关的，哪些不相关。

map.sentiment <- (expression({
  suppressWarnings(suppressPackageStartupMessages(library(rjson)))
  suppressWarnings(suppressPackageStartupMessages(library(tm)))
  suppressWarnings(suppressPackageStartupMessages(library(stringi)))

  #read lines
  l <- unlist(map.values)
  n <- 0
  #create a numberic vector
  jText <- character()

  #create emoticon dictionary
  emos <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p", "S", "s"),paste,sep=""))
  emos <- as.character(c(emos, ":'(", ":'-(", "^^", "=(", "=)"))
  #reps <- data.frame(seq_along(emos), emos)
  #reps[,1] <- paste("EMOTIONREPLACE", reps[,1])

  #loop to read lines for each block of map values
  for(i in 1:length(l)){ #length(l)

     textJSON <- fromJSON(unlist(map.values[i]))

     #extract number of line (n)
     n <- textJSON$n

     #extract text values from json
     jsonText.sentiment <- textJSON$text

     cleanTweets <- function(jsonText){
       #some necessary clean up
       #remove retweet entity
       jsonText <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", " ", jsonText)

       #remove @people
       jsonText <- gsub("@\\w+", " ", jsonText)

       #remove html links
       jsonText <- gsub("http\\S+\\s*", " ", jsonText)

       #remove punctuation
       jsonText <- gsub("([^a-zA-Z](')[^a-zA-Z])", " ", jsonText)
       jsonText <- gsub("(#)[^a-zA-Z0-9]"," ", jsonText)
       jsonText <- gsub("[-=]{1,4}(&gt;)|(&lt;)", "", jsonText)
       jsonText <- gsub("(&gt;)|(&lt;)","",jsonText)
       jsonText <- gsub("([-=$+^!?.|~]){1,4}", "", jsonText)

       escape_regex <- function(r){
         stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
       }

       #keep the emoticons
       regex1 <- stri_c("(", stri_c(escape_regex(emos), collapse="|"), ")")
       jsonText <- stri_replace_all_regex(jsonText, stri_c(regex1, "|\\p{P}"), " $1 ")

       sonText <- gsub('[[:digit:]]+', ' ', jsonText)

       #remove words less than lenght 2
       jsonText <- gsub("\\b[a-zA-z]{1,2}\\b"," ",jsonText)
       jsonText <- removeWords(jsonText, stopwords("en"))
       jsonText <- gsub("^[ \t]+|[ \t]+$", "", jsonText)

       jsonText <- tolower(jsonText)

       jsonText
     }


     jsonText.sentiment.clean <- cleanTweets(jsonText.sentiment) #sentiment

     jText <- c(jText, paste0(unlist(strsplit(jsonText.sentiment.clean, " ")), "@doc=", n))

     }
  lapply(jText, function(i) rhcollect(i, 1))

}))

MapReduce作业使用所有2TB HDFS空间进行小型20MB文件字数分析

0 个答案: