使用Scala将文件从Local迁移到HDFS时出错

时间:2017-12-20 07:35:02

标签: scala hadoop

我有一个Scala列表:fileNames,它包含本地目录中存在的文件名。 例如:

fileNames(2)
res0: String = file:///tmp/audits/xx_user.log

我正在尝试从列表中移动文件:fileNames,使用Scala从本地移动到HDFS。为此,我按照以下步骤操作:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.commons.io.IOUtils;
val hadoopconf = new Configuration();
hadoopconf.addResource(new Path("/etc/hadoop/conf/core-site.xml"));
val fs = FileSystem.get(hadoopconf);
val outFileStream = fs.create(new Path("hdfs://mydev/user/devusr/testfolder"))

这里的代码工作正常。当我尝试添加inputStream时,我收到如下错误消息:

val inStream = fs.open(new Path(fileNames(2)))
java.lang.IllegalArgumentException: Wrong FS: file:/tmp/audits/xx_user.log, expected: hdfs://mergedev

我也试过直接指定文件名,结果是一样的:

val inStream = fs.open(new Path("file:///tmp/audits/xx_user.log"))
java.lang.IllegalArgumentException: Wrong FS: file:/tmp/audits/xx_user.log, expected: hdfs://mergedev

但是当我尝试将文件直接加载到spark中时,它运行正常:

val localToSpark = spark.read.text(fileNames(2))
localToSpark: org.apache.spark.sql.DataFrame = [value: string]
localToSpark.collect
res1: Array[org.apache.spark.sql.Row] = Array([[Wed Dec 20 06:18:02 UTC 2017] INFO: ], [*********************************************************************************************************], [ ], [[Wed Dec 20 06:18:02 UTC 2017] INFO: Diagnostic log for xx_user.]

有人能告诉我在这一点上我做的错误是什么: val inStream = fs.open(new Path(fileNames(2)))我收到了错误。

1 个答案:

答案 0 :(得分:2)

对于小文件,copyFromLocalFile()就足够了:

fs.copyFromLocalFile(new Path(localFileName), new Path(hdfsFileName))

对于大文件,使用Apache Commons-IO更有效:

IOUtils.copyLarge(
  new FileInputStream(new File(localFileName)), fs.create(new Path(hdfsFileName)))

请注意,本地文件名应包含协议(因此不存在file:///