无法使用Google Dataproc上的SparkContext.textFile(...)通过FTP读取文件

时间:2016-12-06 11:03:16

标签: apache-spark ftp google-cloud-dataproc

我在Google Dataproc上运行Spark群集,尝试使用sparkContext.textFile(...)从FTP读取GZipped文件时遇到一些问题。

我正在运行的代码是:

object SparkFtpTest extends App {
  val file = "ftp://username:password@host:21/filename.txt.gz"
  val lines = sc.textFile(file)
  lines.saveAsTextFile("gs://my-bucket-storage/tmp123")
}

我得到的错误是:

Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.

我看到有些人认为凭据有误,所以我尝试输入错误的凭据,错误也不同,即无效的登录凭据。

如果我将URL复制到浏览器中也可以正常工作 - 文件正在正确下载。

还值得一提的是,我尝试过使用Apache commons-net库(与Spark中的版本相同的版本 - 2.2)并且它有效 - 我能够流式传输数据(来自Master和Worker节点)。我无法解压缩它(通过使用Java的GZipInputStream;我不记得失败,但如果你认为它很重要,我可以尝试重现它)。我认为这表明群集上没有防火墙问题,但我无法使用curl下载文件。

我想我几个月前从我的本地机器上运行了相同的代码,如果我没记错的话,它运行得很好。

您有什么想法导致这个问题吗? 可能是某种依赖性冲突问题,如果是这样的话?

我在项目中有几个依赖项,例如google-sdk,solrj,......但是,如果它是一个依赖项问题,我希望会看到类似ClassNotFoundExceptionNoSuchMethodError的内容

整个堆栈跟踪如下所示:

16/12/05 23:53:46 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:47 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
16/12/05 23:53:49 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/0/
16/12/05 23:53:50 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/0/ - removing from cache
16/12/05 23:53:50 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:51 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
    at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:298)
    at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:495)
    at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:537)
    at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:586)
    at org.apache.commons.net.ftp.FTP.quit(FTP.java:794)
    at org.apache.commons.net.ftp.FTPClient.logout(FTPClient.java:788)
    at org.apache.hadoop.fs.ftp.FTPFileSystem.disconnect(FTPFileSystem.java:151)
    at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:395)
    at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:222)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1906)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1219)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1161)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1064)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1030)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:956)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:955)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1459)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1438)

1 个答案:

答案 0 :(得分:1)

看起来这可能是Spark / Hadoop中一个已知未解决的问题:https://issues.apache.org/jira/browse/HADOOP-11886https://github.com/databricks/learning-spark/issues/21都提到类似的堆栈跟踪。

如果您能够手动使用Apache commons-net库,则可以通过获取文件列表,将该文件列表并行化为RDD并使用{{sc.textFile来实现与flatMap相同的效果。 1}}每个任务采用文件名并逐行读取文件,为每个文件生成行的输出集合。

或者,如果你在FTP中拥有的数据量很小(最多可能是10 GB左右),那么与从FTP服务器复制到HDFS或GCS的单个线程相比,并行读取不会太多帮助。在使用Spark作业中的HDFS或GCS路径进行处理之前,您的Dataproc群集。