Question

我尝试使用此scala代码从ftp站点下载文件。代码如下。

object BasicTextFromFTP {
def main(args: Array[String]) {
    val conf = new org.apache.spark.SparkConf().setAppName("FTP Test")
    conf.setMaster("local")
    val sc = new SparkContext(conf)
    val file = sc.textFile("ftp://anonymous:pandamagic@ftp.ubuntu.com/ubuntu/ls-LR.gz")
    println(file.collect().mkString("\n"))
}
}

运行时出现以下错误。

16/02/12 10:52:22 INFO SparkContext：在BasicTextFromFTP.scala中从textFile创建广播0：14 线程“main”中的异常org.apache.hadoop.mapred.InvalidInputException：输入路径不存在：ftp://anonymous:pandamagic@ftp.ubuntu.com/ubuntu/ls-LR.gz 在org.apache.hadoop.mapred.FileInputFormat.listStatus（FileInputFormat.java:251）在org.apache.hadoop.mapred.FileInputFormat.getSplits（FileInputFormat.java:270）在org.apache.spark.rdd.HadoopRDD.getPartitions（HadoopRDD.scala：199）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：239）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：237）在scala.Option.getOrElse（Option.scala：121）在org.apache.spark.rdd.RDD.partitions（RDD.scala：237）在org.apache.spark.rdd.MapPartitionsRDD.getPartitions（MapPartitionsRDD.scala：35）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：239）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：237）在scala.Option.getOrElse（Option.scala：121）在org.apache.spark.rdd.RDD.partitions（RDD.scala：237）在org.apache.spark.SparkContext.runJob（SparkContext.scala：1929）在org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply（RDD.scala：927）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：150）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：111）在org.apache.spark.rdd.RDD.withScope（RDD.scala：316）在org.apache.spark.rdd.RDD.collect（RDD.scala：926）在ftp.BasicTextFromFTP $ .main（BasicTextFromFTP.scala：15） at ftp.BasicTextFromFTP.main（BasicTextFromFTP.scala） at sun.reflect.NativeMethodAccessorImpl.invoke0（Native Method） at sun.reflect.NativeMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java:57） at sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:43） at java.lang.reflect.Method.invoke（Method.java:606）在com.intellij.rt.execution.application.AppMain.main（AppMain.java:144）

我使用带有scala 2.11的spark 1.6.0。

Answer 1

您是否可以访问此FTP网址？我无法知道有时FTP在企业网络中被阻止。您可能希望下载此HDFS API项目（https://github.com/pppsunil/HelloHDFS）并从命令行运行它以查看您是否能够获取该文件。您可以在此博客条目http://wpcertification.blogspot.com/2014/07/hdfs-java-client.html上找到有关该计划的更多信息。如果访问FTP网址的基本功能不起作用那么那就是你的问题，如果不是那么它可能是与spark相关的东西

如何使用SparkContext.textFile ftp文件？

1 个答案: