由于文件已存在,无法构建spark-tensorflow-connector

时间:2019-01-27 14:45:47

标签: apache-spark tensorflow apache-spark-sql tensorflow-datasets

在GCP的Dataproc上构建spark-tensorflow-connector时遇到问题。

由于其中一项测试失败而发生问题

java.lang.IllegalStateException: LocalPath /tmp/spark-connector-propagate7442350445858279141 already exists. SaveMode: ErrorIfExists

我认为该问题与LocalWiteSuite.scala脚本的这一部分有关:

"Propagate" should {
   "write data locally" in {
     // Create a dataframe with 2 partitions
     val rdd = spark.sparkContext.parallelize(testRows, numSlices = 2)
     val df = spark.createDataFrame(rdd, schema)

     // Write the partitions onto the local hard drive. Since it is going to be the
     // local file system, the partitions will be written in the same directory of the
     // same machine.
     // In a distributed setting though, two different machines would each hold a single
     // partition.
     val localPath = Files.createTempDirectory("spark-connector-propagate").toAbsolutePath.toString
     // Delete the directory, the default mode is ErrorIfExists
     Files.delete(Paths.get(localPath))
     df.write.format("tfrecords")
       .option("recordType", "Example")
       .option("writeLocality", "local")
       .save(localPath)

     // Read again this directory, this time using the Hadoop file readers, it should
     // return the same data.
     // This only works in this test and does not hold in general, because the partitions
     // will be written on the workers. Everything runs locally for tests.
     val df2 = spark.read.format("tfrecords").option("recordType", "Example")
       .load(localPath).sort("id").select("id", "IntegerTypeLabel", "LongTypeLabel",
       "FloatTypeLabel", "DoubleTypeLabel", "VectorLabel", "name") // Correct column order.

     assert(df2.collect().toSeq === testRows.toSeq)
   }
 }
}

如果我理解正确,则数据集有两个分区,似乎正在尝试使用相同的文件名在本地写入。

有人遇到过这个问题吗?还是我错过了一步?

请注意,我发布了类似的question on GitHub

1 个答案:

答案 0 :(得分:0)

我觉得我错过了一步,因为这是一个非常有价值的软件包,而且很多人已经成功安装了spark-tensorflow-connector:

我没有将Tensorflow hadoop构建为第3步中明确定义的Maven依赖关系。

但是,在构建Tensorflow hadoop时,我不得不使用一个附加命令:export _JAVA_OPTIONS=-Djdk.net.URLClassPath.disableClassPathURLCheck=true,如Michael在Maven surefire could not find ForkedBooter class

中所建议的那样。

编辑:问题仍然存在于Dataproc

解决方案:

经过研究,我直接为spark-tensorflow-connector加载了最新版本,并按照Maven发布的说明进行了安装。我不必按照Tensorflow Ecosystem中的建议安装Tensorflow Hadoop。请注意,我能够在我的Dataproc集群上安装jar文件。