Question

我的环境使用Spark，Pig和Hive。

我在使用Scala（或与我的环境兼容的任何其他语言）编写代码时遇到一些麻烦，该代码可能会将文件从本地文件系统复制到HDFS。

有没有人对如何进行有任何建议？

Answer 1

其他答案对我不起作用，所以我在这里写另一个。

尝试以下Scala代码：

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path

val hadoopConf = new Configuration()
val hdfs = FileSystem.get(hadoopConf)

val srcPath = new Path(srcFilePath)
val destPath = new Path(destFilePath)

hdfs.copyFromLocalFile(srcPath, destPath)

您还应该检查Spark是否在HADOOP_CONF_DIR文件中设置了conf/spark-env.sh变量。这将确保Spark将找到Hadoop配置设置。

build.sbt文件的依赖项：

libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.6.0"
libraryDependencies += "org.apache.commons" % "commons-io" % "1.3.2"
libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"

Answer 2

您可以使用Hadoop FileSystem API编写Scala作业并使用apache commons中的IOUtils将数据从InputStream复制到OutputStream

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.commons.io.IOUtils;



val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);

//Create output stream to HDFS file
val outFileStream = fs.create(new Path("hedf://<namenode>:<port>/<filename>))

//Create input stream from local file
val inStream = fs.open(new Path("file://<input_file>"))

IOUtils.copy(inStream, outFileStream)

//Close both files
inStream.close()
outFileStream.close()

Answer 3

这适用于S3（从上面修改）

def cpToS3(localPath: String, s3Path: String) = {
  val hdfs = FileSystem.get(
               new URI(s3Path), 
               spark.sparkContext.hadoopConfiguration)

  val srcPath = new Path(localPath)
  val destPath = new Path(s3Path)

  hdfs.copyFromLocalFile(srcPath, destPath)
}

将文件从本地移动到HDFS

3 个答案: