需要使导入功能更加迅速

时间:2019-01-02 12:43:24

标签: r hdfs hadoop2

我正在创建此功能,以将文件从HDFS导入RStudio,并且工作正常。但是问题在于,需要很长时间才能给出所需的结果。

library(data.table)

import_file <- function (file_Path)
{

data.fichier <- as.data.table(system(paste("hadoop fs -cat",PAPath),intern=TRUE))
return(do.call(rbind, stringr::str_split(data.fichier$V1, ',')))

}

实际上,它的输入是HDFS中文件的目录:

/hdfs/data/lll/l111/l11/l1/InterfacePublique-Controle-PUB_1EEUC-201803-PR-20181004-100228-indicateurs-PUB_1EEUC/*

这是输出的一个示例:

  [,1]                                [,2]                      [,3] [,4]       [,5]                   
   [1,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_COE"                 ""   "819832"   "3.2664467021013293"   
   [2,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_COT"                 ""   "937680"   "3.7359870603079344"   
   [3,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_EMP"                 ""   "3797954"  "15.132142095005504"   
   [4,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_SOU"                 ""   "1327439"  "5.288899120540168"    
   [5,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "001_TIT"                 ""   "13849361" "55.17984119265992"    
   [6,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "002_COE"                 ""   "33716"    "0.13433425019766052"  
   [7,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "002_COT"                 ""   "31649"    "0.1260987271475192"   
   [8,] "DIS_CD_SI_CD_QUL_SGN_PSE"          "002_EMP"                 ""   "158625"   "0.632007665132397"    

请允许任何高级人员优化其代码?

0 个答案:

没有答案