我能够将结构化流式传输的结果写入Parquet文件。问题是那些文件在本地文件系统中,现在我想将它们写入Hadoop文件系统。有没有办法做到这一点?
StreamingQuery query = result //.orderBy("window")
.repartition(1)
.writeStream()
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "hdfs://localhost:19000/data/checkpoints")
.start("hdfs://localhost:19000/data/total");
我使用了这段代码,但它说:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:19000/data/checkpoints/metadata, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:100)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:206)
由于
答案 0 :(得分:1)
这是一个众所周知的问题:https://issues.apache.org/jira/browse/SPARK-19407
应该在下一个版本中修复。您可以使用--conf spark.hadoop.fs.defaultFS=hdfs://localhost:19000
将默认文件系统设置为s3作为解决方法。
答案 1 :(得分:0)
这对我有用,所以星火升级可能解决了这个问题:
option("checkpointLocation", "hdfs:///project/dz/collab/stream/hdfs/chk_ucra").trigger(Trigger.ProcessingTime("300 seconds")).start("/project/dz/collab/stream/hdfs/ucra")