我正在尝试使用自定义R map reduce将一些数据写入HDFS。我已经快速阅读过程,但后期处理写入需要很长时间。我试过(可以写入文件连接的函数)
output <- file("stdout", "w")
write.table(base,file=output,sep=",",row.names=F)
writeLines(t(as.matrix(base)), con = output, sep = ",", useBytes = FALSE)
但是write.table只写部分信息(前几行和最后几行)而writeLines不起作用。所以现在我尝试:
for(row in 1:nrow(base)){
cat(base[row,]$field1,",",base[row,]$field2,",",base[row,]$field3,",",base[row,]$field4,",",
base[row,]$field5,",",base[row,]$field6,"\n",sep='')
}
但是这个写作速度很慢。这里有一些关于写入速度有多慢的日志:
2016-07-07 08:59:30,557 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/406056 2016-07-07 08:59:40,567 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/406422 2016-07-07 08:59:50,582 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/406710 2016-07-07 09:00:00,947 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/407001 2016-07-07 09:00:11,392 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/407316 2016-07-07 09:00:21,832 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/407683 2016-07-07 09:00:31,883 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/408103 2016-07-07 09:00:41,892 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/408536 2016-07-07 09:00:51,895 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/408969 2016-07-07 09:01:01,903 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/409377 2016-07-07 09:01:12,187 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/409782 2016-07-07 09:01:22,198 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/410161 2016-07-07 09:01:32,293 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/410569 2016-07-07 09:01:42,509 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/410989 2016-07-07 09:01:52,515 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/411435 2016-07-07 09:02:02,525 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/411814 2016-07-07 09:02:12,625 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/412196 2016-07-07 09:02:22,988 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/412616 2016-07-07 09:02:32,991 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/413078 2016-07-07 09:02:43,104 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/413508 2016-07-07 09:02:53,115 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/413975 2016-07-07 09:03:03,122 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/414415 2016-07-07 09:03:13,128 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/414835 2016-07-07 09:03:23,131 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/415210 2016-07-07 09:03:33,143 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/415643 2016-07-07 09:03:43,453 INFO [Thread-49] org.apache.hadoop.streaming.PipeMapRed:记录R / W = 921203/416031
所以我想知道我做错了什么。我正在使用data.table。
答案 0 :(得分:0)
根据我对具有文件编写功能的各种功能的不同实验,我发现最快:
base <- data.table(apply(base,2,FUN=as.character),stringsAsFactors = F)
x <- sapply(1:nrow(base),
FUN = function(row) {
cat(base$field1[row],",", base$field2[row], "," , base$field3[row], "," ,
base$field4[row], "," , base$field5[row], "," , base$field6[row], "\n" , sep='')
}
)
rm(x)
其中x
就是捕获NULL返回sapply
抛出而sapply
as.character
是为了防止cat
对因素造成的混乱(打印内部因子值而不是实际值。)