我正在批处理管道中从有界源(csv文件)中读取数据,并希望根据在csv文件中存储为列的数据为元素分配时间戳。如何在Apache Beam管道中执行此操作?
答案 0 :(得分:2)
如果批处理的数据源中每个元素都包含基于事件的时间戳,例如,您有一个具有元组{'timestamp, 'userid','ClickedSomething'}
的click事件。您可以将时间戳记分配给管道中DoFn
中的元素。
Java:
public void process(ProcessContext c){
c.outputWithTimestamp(
c.element(),
new Instant(c.element().getTimestamp()));
}
Python:
'AddEventTimestamps' >> beam.Map(
lambda elem: beam.window.TimestampedValue(elem, elem['timestamp']))
[从Beam指南编辑非lambda Python示例:]
class AddTimestampDoFn(beam.DoFn):
def process(self, element):
# Extract the numeric Unix seconds-since-epoch timestamp to be
# associated with the current log entry.
unix_timestamp = extract_timestamp_from_log_entry(element)
# Wrap and emit the current entry and new timestamp in a
# TimestampedValue.
yield beam.window.TimestampedValue(element, unix_timestamp)
timestamped_items = items | 'timestamp' >> beam.ParDo(AddTimestampDoFn())
[根据安东评论进行编辑] 更多信息,请参见@