我正在创建此功能,以将文件从HDFS导入RStudio,并且工作正常。但是问题在于,需要很长时间才能给出所需的结果。
library(data.table)
import_file <- function (file_Path)
{
data.fichier <- as.data.table(system(paste("hadoop fs -cat",PAPath),intern=TRUE))
return(do.call(rbind, stringr::str_split(data.fichier$V1, ',')))
}
实际上,它的输入是HDFS中文件的目录:
/hdfs/data/lll/l111/l11/l1/InterfacePublique-Controle-PUB_1EEUC-201803-PR-20181004-100228-indicateurs-PUB_1EEUC/*
这是输出的一个示例:
[,1] [,2] [,3] [,4] [,5]
[1,] "DIS_CD_SI_CD_QUL_SGN_PSE" "001_COE" "" "819832" "3.2664467021013293"
[2,] "DIS_CD_SI_CD_QUL_SGN_PSE" "001_COT" "" "937680" "3.7359870603079344"
[3,] "DIS_CD_SI_CD_QUL_SGN_PSE" "001_EMP" "" "3797954" "15.132142095005504"
[4,] "DIS_CD_SI_CD_QUL_SGN_PSE" "001_SOU" "" "1327439" "5.288899120540168"
[5,] "DIS_CD_SI_CD_QUL_SGN_PSE" "001_TIT" "" "13849361" "55.17984119265992"
[6,] "DIS_CD_SI_CD_QUL_SGN_PSE" "002_COE" "" "33716" "0.13433425019766052"
[7,] "DIS_CD_SI_CD_QUL_SGN_PSE" "002_COT" "" "31649" "0.1260987271475192"
[8,] "DIS_CD_SI_CD_QUL_SGN_PSE" "002_EMP" "" "158625" "0.632007665132397"
请允许任何高级人员优化其代码?