具有结构化流的S3检查点

时间:2017-07-07 14:30:28

标签: java apache-spark amazon-s3 spark-structured-streaming checkpointing

我已尝试过Apache Spark (Structured Streaming) : S3 Checkpoint support

中提供的建议

我仍然面临这个问题。以下是我得到的错误

17/07/06 17:04:56 WARN FileSystem: "s3n" is a deprecated filesystem 
name. Use "hdfs://s3n/" instead.
Exception in thread "main" java.lang.IllegalArgumentException: 
java.net.UnknownHostException: s3n

我的代码中有这样的东西

SparkSession spark = SparkSession
    .builder()
    .master("local[*]")
    .config("spark.hadoop.fs.defaultFS","s3")
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    .config("spark.hadoop.fs.s3n.awsAccessKeyId","<my-key>")
    .config("spark.hadoop.fs.s3n.awsSecretAccessKey","<my-secret-key>")
    .appName("My Spark App")
    .getOrCreate();

然后像这样使用checkpoint目录:

StreamingQuery line = topicValue.writeStream()
   .option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")

感谢任何帮助。提前谢谢!

1 个答案:

答案 0 :(得分:3)

要在结构化流媒体中对S3进行检查点支持,您可以尝试以下方式:

SparkSession spark = SparkSession
    .builder()
    .master("local[*]")
    .appName("My Spark App")
    .getOrCreate();

spark.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<my-key>")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "<my-secret-key>")

然后checkpoint目录可以是这样的:

StreamingQuery line = topicValue.writeStream()
   .option("checkpointLocation","s3n://<my-bucket>/checkpointLocation/")

我希望这有帮助!