我有以下代码片段,用于演示以流模式将数据写入文件。但是当应用程序开始运行时,我检查了输出目录,我观察到没有写入数据并且 _SUCCESS
不存在
每个分区(例如:p_day=2021-04-30\p_hour=18\p_min=52
)包含大约58个文件(每个都是空文件,看起来每秒创建一个文件),名称类似于.part-01c10c8a-5ffa-4ffc-91d0-57ed68d85c93-0-0.inprogress.1cc628a9-0f88-46ca-9d14-7f10db332184
。如果我将 'format' = 'parquet',
更改为 'format' = 'csv',
并保持其他代码不变,则应用程序可以正常工作并成功将数据写入为 csv 并且 _SUCCESS
出现在每个分区中。
有人可以帮忙看看,我已经被这个问题困扰了好几个小时了
应用代码:
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
import org.example.sources.{InfiniteEventSource, MyEvent}
import org.apache.flink.streaming.api.scala._
object T007_ParquetFormatFileSystemSink {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.enableCheckpointing(20*1000)
env.setStateBackend(new FsStateBackend("file:///d:/flink-checkpoints"))
val ds: DataStream[MyEvent] = env.addSource(new InfiniteEventSource(emitInterval = 5 * 1000))
val tenv = StreamTableEnvironment.create(env)
tenv.createTemporaryView("sourceTable", ds)
ds.print()
//Change 'format' = 'parquet', to 'format' = 'csv', then the application works
val ddl =
s"""
create table sinkTable(
id string,
p_day STRING,
p_hour STRING,
p_min STRING
) partitioned by(p_day, p_hour, p_min) with (
'connector' = 'filesystem',
'path' = 'D:/csv-${System.currentTimeMillis()}',
'format' = 'parquet',
'sink.rolling-policy.check-interval' = '5 s',
'sink.rolling-policy.rollover-interval' = '20 s',
'sink.partition-commit.trigger'='process-time',
'sink.partition-commit.policy.kind'='success-file',
'sink.partition-commit.delay' = '0 s'
)
""".stripMargin(' ')
tenv.executeSql(ddl)
tenv.executeSql(
"""
insert into sinkTable
select id, date_format(occurrenceTime,'yyyy-MM-dd'), date_format(occurrenceTime, 'HH'), date_format(occurrenceTime, 'mm') from sourceTable
""".stripMargin(' '))
env.execute()
}
}
InfiniteEventSource
代码
import java.sql.Timestamp
import java.util.Date
import java.util.concurrent.atomic.AtomicBoolean
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
case class MyEvent(id: String, occurrenceTime: Timestamp)
class InfiniteEventSource(emitInterval: Long) extends RichSourceFunction[MyEvent] {
var id: Long = 1
val running = new AtomicBoolean(true)
override def run(sc: SourceFunction.SourceContext[MyEvent]): Unit = {
while (running.get()) {
sc.collect(MyEvent(id.toString, new Timestamp(new Date().getTime)))
if (emitInterval > 0) {
Thread.sleep(emitInterval)
}
id += 1
}
}
override def cancel(): Unit = {
running.set(false)
}
}
答案 0 :(得分:0)
文件系统接收器要求启用检查点,并且挂起的文件仅在检查点期间完成。
这在 Parquet 中尤为明显,因为这是一种具有基于检查点的滚动策略的批量格式。
也可能是您遇到了this issue:
<块引用>鉴于 Flink sinks 和 UDF 通常不区分正常作业终止(例如有限输入流)和由于失败而终止,因此在作业正常终止时,最后一个正在进行的文件将不会转换到“完成”状态。