当我在unix shell / terminal中运行所有这3条命令时,它们都工作正常,返回退出状态为0
unix_shell> ls -la
unix_shell> hadoop fs -ls /user/hadoop/temp
unix_shell> s3-dist-cp --src ./abc.txt --dest s3://bucket/folder/
现在我正尝试通过scala进程api作为外部进程运行这些相同的命令,示例代码如下:
import scala.sys.process._
val cmd_1 = "ls -la"
val cmd_2 = "hadoop fs -ls /user/hadoop/temp/"
val cmd_3 = "/usr/bin/s3-dist-cp --src /tmp/sample.txt --dest s3://bucket/folder/"
val cmd_4 = "s3-dist-cp --src /tmp/sample.txt --dest s3://bucket/folder/"
val exitCode_1 = (stringToProcess(cmd_1)).! // works fine and produces result
val exitCode_2 = (stringToProcess(cmd_2)).! // works fine and produces result
val exitCode_3 = (stringToProcess(cmd_3)).! // **it just hangs, yielding nothing**
val exitCode_4 = (stringToProcess(cmd_4)).! // **it just hangs, yielding nothing**
cmd_3和cmd_4之间的差异只是绝对路径。 而且我正在如下图
所示的spark-submit脚本中显式传递相关的依赖--jars hdfs:///user/hadoop/s3-dist-cp.jar
您的意见/建议将有所帮助。谢谢!
答案 0 :(得分:1)
似乎您所做的事情是正确的。请参见此处 https://github.com/gorros/spark-scala-tips/blob/master/README.md
import scala.sys.process._
def s3distCp(src: String, dest: String): Unit = {
s"s3-dist-cp --src $src --dest $dest".!
}
请检查此注释...我想知道您是否遇到这种情况。
关于您的--jars /usr/lib/hadoop/client/*.jar
您可以使用this. see my answer之类的s3-dist-cp
命令附加与tr
相关的jars
--jars $(echo /dir_of_jars/*.jar | tr ' ' ',')
注意:要能够使用此方法,您需要添加Hadoop应用程序,并且需要在客户端或本地模式下运行Spark,因为s3-dist-cp
在从属节点上不可用。如果要在群集模式下运行,请在引导过程中将s3-dist-cp
命令复制到从属服务器。
答案 1 :(得分:1)
实际上scala进程在spark上下文之外运行,因此,为了成功运行s3-dist-cp命令,我要做的就是在启动包含s3-dist-cp命令的scala进程之前停止spark上下文。完整的工作代码如下:
logger.info("Moving ORC files from HDFS to S3 !!")
import scala.sys.process._
logger.info("stopping spark context..##")
val spark = IngestionContext.sparkSession
spark.stop()
logger.info("spark context stopped..##")
logger.info("sleeping for 10 secs")
Thread.sleep(10000) // this sleep is not required, this was just for debugging purpose, you can remove this in your final code.
logger.info("woke up after sleeping for 10 secs")
try {
/**
* following is the java version, off course you need take care of few imports
*/
//val pb = new java.lang.ProcessBuilder("s3-dist-cp", "--src", INGESTED_ORC_DIR, "--dest", "s3:/" + paramMap(Storage_Output_Path).substring(4) + "_temp", "--srcPattern", ".*\\.orc")
//val pb = new java.lang.ProcessBuilder("hadoop", "jar", "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar", "--src", INGESTED_ORC_DIR, "--dest", "s3:/" + paramMap(Storage_Output_Path).substring(4) + "_temp", "--srcPattern", ".*\\.orc")
//pb.directory(new File("/tmp"))
//pb.inheritIO()
//pb.redirectErrorStream(true)
//val process = pb.start()
//val is = process.getInputStream()
//val isr = new InputStreamReader(is)
//val br = new BufferedReader(isr)
//var line = ""
//logger.info("printling lines:")
//while (line != null) {
// line = br.readLine()
// logger.info("line=[{}]", line)
//}
//logger.info("process goes into waiting state")
//logger.info("Waited for: " + process.waitFor())
//logger.info("Program terminated!")
/**
* following is the scala version
*/
val S3_DIST_CP = "s3-dist-cp"
val INGESTED_ORC_DIR = S3Util.getSaveOrcPath()
// listing out all the files
//val s3DistCpCmd = S3_DIST_CP + " --src " + INGESTED_ORC_DIR + " --dest " + paramMap(Storage_Output_Path).substring(4) + "_temp --srcPattern .*\\.orc"
//-Dmapred.child.java.opts=-Xmx1024m -Dmapreduce.job.reduces=2
val cmd = S3_DIST_CP + " --src " + INGESTED_ORC_DIR + " --dest " + "s3:/" + paramMap(Storage_Output_Path).substring(4) + "_temp --srcPattern .*\\.orc"
//val cmd = "hdfs dfs -cp -f " + INGESTED_ORC_DIR + "/* " + "s3:/" + paramMap(Storage_Output_Path).substring(4) + "_temp/"
//val cmd = "hadoop distcp " + INGESTED_ORC_DIR + "/ s3:/" + paramMap(Storage_Output_Path).substring(4) + "_temp_2/"
logger.info("full hdfs to s3 command : [{}]", cmd)
// command execution
val exitCode = (stringToProcess(cmd)).!
logger.info("s3_dist_cp command exit code: {} and s3 copy got " + (if (exitCode == 0) "SUCCEEDED" else "FAILED"), exitCode)
} catch {
case ex: Exception =>
logger.error(
"there was an exception while copying orc file to s3 bucket. {} {}",
"", ex.getMessage, ex)
throw new IngestionException("s3 dist cp command failure", null, Some(StatusEnum.S3_DIST_CP_CMD_FAILED))
}
尽管上面的代码可以完全按预期工作,但是也有其他一些奇怪的发现:
代替此
val exitCode = (stringToProcess(cmd)).!
如果您使用此
val exitCode = (stringToProcess(cmd)).!!
请注意单身的区别!和双!!,作为单!只返回退出代码,而双!返回流程执行的输出
所以在单身的情况下!上面的代码很好用,在double !!的情况下也可以,但是它在S3存储桶中生成的文件和副本太多,而不是原始文件的数量。
对于spark-submit命令,无需担心--driver-class-path甚至--jars选项,因为我没有传递任何依赖关系。