如何根据传入的数据配置Dataflow动态设置BigQuery表分区?
答案 0 :(得分:1)
我采取的方法如下:
将窗口转换为表名
p.apply(PubsubIO.Read
.subscription(subscription)
.withCoder(TableRowJsonCoder.of())
)
.apply(Window.into(new TablePartitionWindowFn()) )
.apply(BigQueryIO.Write
.to(new DayPartitionFunc(dataset, table))
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
);
根据传入数据设置窗口,可以忽略结束Instant,因为起始值用于设置分区:
public class TablePartitionWindowFn extends NonMergingWindowFn<Object, IntervalWindow> {
private IntervalWindow assignWindow(AssignContext context) {
TableRow source = (TableRow) context.element();
String dttm_str = (String) source.get("DTTM");
DateTimeFormatter formatter = DateTimeFormat.forPattern("yyyy-MM-dd").withZoneUTC();
Instant start_point = Instant.parse(dttm_str,formatter);
Instant end_point = start_point.withDurationAdded(1000, 1);
return new IntervalWindow(start_point, end_point);
};
动态设置表分区:
public class DayPartitionFunc implements SerializableFunction<BoundedWindow, String> {
String destination = "";
public DayPartitionFunc(String dataset, String table) {
this.destination = dataset + "." + table+ "$";
}
@Override
public String apply(BoundedWindow boundedWindow) {
// The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
String dayString = DateTimeFormat.forPattern("yyyyMMdd")
.withZone(DateTimeZone.UTC)
.print(((IntervalWindow) boundedWindow).start());
return destination + dayString;
}}
让我知道是否有更好的方法来实现相同的结果。