从R写入HDFS的最快方式(没有任何包)

时间:2016-07-07 14:12:54

标签: r hadoop mapreduce

我正在尝试使用自定义R map reduce将一些数据写入HDFS。我已经快速阅读过程,但后期处理写入需要很长时间。我试过(可以写入文件连接的函数)

output <- file("stdout", "w")
write.table(base,file=output,sep=",",row.names=F)
writeLines(t(as.matrix(base)), con = output, sep = ",", useBytes = FALSE)

但是write.table只写部分信息(前几行和最后几行)而writeLines不起作用。所以现在我尝试:

for(row in 1:nrow(base)){
      cat(base[row,]$field1,",",base[row,]$field2,",",base[row,]$field3,",",base[row,]$field4,",",
          base[row,]$field5,",",base[row,]$field6,"\n",sep='')
    }
但是这个写作速度很慢。这里有一些关于写入速度有多慢的日志:

  

2016-07-07 08:59:30,557 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/406056   2016-07-07 08:59:40,567 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/406422   2016-07-07 08:59:50,582 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/406710   2016-07-07 09:00:00,947 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/407001   2016-07-07 09:00:11,392 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/407316   2016-07-07 09:00:21,832 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/407683   2016-07-07 09:00:31,883 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/408103   2016-07-07 09:00:41,892 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/408536   2016-07-07 09:00:51,895 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/408969   2016-07-07 09:01:01,903 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/409377   2016-07-07 09:01:12,187 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/409782   2016-07-07 09:01:22,198 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/410161   2016-07-07 09:01:32,293 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/410569   2016-07-07 09:01:42,509 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/410989   2016-07-07 09:01:52,515 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/411435   2016-07-07 09:02:02,525 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/411814   2016-07-07 09:02:12,625 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/412196   2016-07-07 09:02:22,988 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/412616   2016-07-07 09:02:32,991 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/413078   2016-07-07 09:02:43,104 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/413508   2016-07-07 09:02:53,115 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/413975   2016-07-07 09:03:03,122 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/414415   2016-07-07 09:03:13,128 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/414835   2016-07-07 09:03:23,131 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/415210   2016-07-07 09:03:33,143 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/415643   2016-07-07 09:03:43,453 INFO [Thread-49]   org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/416031

所以我想知道我做错了什么。我正在使用data.table。

1 个答案:

答案 0 :(得分:0)

根据我对具有文件编写功能的各种功能的不同实验,我发现最快:

base <- data.table(apply(base,2,FUN=as.character),stringsAsFactors = F)
x <- sapply(1:nrow(base), 
FUN = function(row) {
cat(base$field1[row],",", base$field2[row], "," ,  base$field3[row], "," , 
base$field4[row], "," , base$field5[row], "," , base$field6[row], "\n" , sep='')
                    }
)
rm(x)

其中x就是捕获NULL返回sapply抛出而sapply as.character是为了防止cat对因素造成的混乱(打印内部因子值而不是实际值。)