我想使用Apache Beam Python SDK构建事件驱动的交易系统以进行回测。此系统中的许多PTransform
操作都属于“时间有效期窗口” /“临时联接”类型。
例如,以Streaming Systems中的工作示例为例,它是Beam的一本有关货币报价和交易的书。一个类似的示例出现在earlier paper中。
Right Table: quotes.txt (Currency Pair, Price, Event Time, Processing Time)
USD/JPY,102.0000,12:00:00,12:04:13
EUR/JPY,114.0000,12:00:30,12:06:23
EUR/JPY,119.0000,12:06:00,12:07:33
EUR/JPY,116.0000,12:03:00,12:09:07
Left Table: orders.txt (Currency Pair, Quantity, Event Time, Processing Time)
USD/JPY,1,12:03:00,12:03:44
EUR/JPY,2,12:02:00,12:05:07
EUR/JPY,5,12:05:00,12:08:00
EUR/JPY,3,12:08:00,12:09:33
USD/JPY,5,12:10:00,12:10:59
假设这两个示例都是可以作为无边界集合的代理(例如,2个带有key = currency对的Kafka主题)。我完全不知道如何使用Apache Beam API对这两个(可能是无边界的)集合进行“左连接”以产生以下输出。
Output Table With Retractions: trades.txt (Currency Pair, Price*Quantity, Order Event Time, Retraction?, Trade Processing Time)
USD/JPY,102.0000,12:03:00,False,12:03:44
EUR/JPY,000.0000,12:02:00,False,12:05:07
EUR/JPY,000.0000,12:02:00,True,12:06:23
EUR/JPY,228.0000,12:02:00,False,12:06:23
EUR/JPY,570.0000,12:05:00,False,12:08:00
EUR/JPY,570.0000,12:05:00,True,12:09:07
EUR/JPY,580.0000,12:05:00,False,12:09:07
EUR/JPY,357.0000,12:08:00,False,12:09:33
USD/JPY,510.0000,12:10:00,False,12:10:59
"Final" Output Table Without Retractions: trades.txt (Currency Pair, Price*Quantity, Order Event Time, Retraction?, Trade Processing Time)
USD/JPY,102.0000,12:03:00,False,12:03:44
EUR/JPY,228.0000,12:02:00,False,12:06:23
EUR/JPY,580.0000,12:05:00,False,12:09:07
EUR/JPY,357.0000,12:08:00,False,12:09:33
USD/JPY,510.0000,12:10:00,False,12:10:59
如何使用Windows,触发器和PTransform
来实现上述CoGroupByKey
?
当前代码-只是一些带有占位符的样板
"""Testing Apache beam joins."""
import logging
import datetime
import decimal
import typing
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
# Schema and Transforms
class Quote(typing.NamedTuple):
base_currency: str
quote_currency: str
price: decimal.Decimal
time_ms: int
class ConvertQuote(beam.DoFn):
def process(self, element):
pair, price_str, time_str, _ = element.rstrip().split(",")
base_currency, quote_currency = pair.split("/")
price = decimal.Decimal(price_str)
time_ms = int(self._time_ms(time_str))
yield Quote(base_currency, quote_currency, price, time_ms)
def _time_ms(self, time):
epoch = datetime.datetime.utcfromtimestamp(0)
dt = datetime.datetime.strptime(time, "%H:%M:%S")
return (dt - epoch).total_seconds() * 1000
class AddQuoteTimestamp(beam.DoFn):
def process(self, element):
yield beam.window.TimestampedValue(element, element.time_ms)
class Order(typing.NamedTuple):
base_currency: str
quote_currency: str
quantity: int
time_ms: int
class ConvertOrder(beam.DoFn):
def process(self, element):
pair, quantity_str, time_str, _ = element.rstrip().split(",")
base_currency, quote_currency = pair.split("/")
quantity = int(quantity_str)
time_ms = int(self._time_ms(time_str))
yield Order(base_currency, quote_currency, quantity, time_ms)
def _time_ms(self, time):
epoch = datetime.datetime.utcfromtimestamp(0)
dt = datetime.datetime.strptime(time, "%H:%M:%S")
return (dt - epoch).total_seconds() * 1000
class AddOrderTimestamp(beam.DoFn):
def process(self, element):
yield beam.window.TimestampedValue(element, element.time_ms)
PAIRS = ["EUR/JPY", "USD/JPY"] # Maybe pass this in as an option?
def by_pair(item, num_pairs):
return PAIRS.index(f"{item.base_currency}/{item.quote_currency}")
# Administrative
LOGGING_MSG_FMT = "%(asctime)s - %(levelname)s: %(message)s"
LOGGING_DATE_FMT = "%Y-%m-%d %H:%M:%S%z"
logging.basicConfig(format=LOGGING_MSG_FMT, datefmt=LOGGING_DATE_FMT, level=logging.INFO)
class MyOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_argument("--quotes-file", dest="quotes_file", default="quotes.txt")
parser.add_argument("--orders-file", dest="orders_file", default="orders.txt")
parser.add_argument("--trades-file", dest="trades_file", default="trades")
options = PipelineOptions()
my_options = options.view_as(MyOptions)
# Main
with beam.Pipeline(options=my_options) as p:
eurjpy_quotes, usdjpy_quotes = (
p
| "ReadQuotes" >> beam.io.ReadFromText(my_options.quotes_file)
| "ConvertQuotes" >> beam.ParDo(ConvertQuote())
| "AddQuoteTimestamps" >> beam.ParDo(AddQuoteTimestamp())
| "PartitionQuotes" >> beam.Partition(by_pair, len(PAIRS))
# Some kind of windowing/triggering?
)
eurjpy_orders, usdjpy_orders = (
p
| "ReadOrders" >> beam.io.ReadFromText(my_options.orders_file)
| "ConvertOrders" >> beam.ParDo(ConvertOrder())
| "AddOrderTimestamps" >> beam.ParDo(AddOrderTimestamp())
| "PartitionOrders" >> beam.Partition(by_pair, len(PAIRS))
# Some kind of windowing/triggering?
)
# Something here using CoGroupByKey on eurjpy_quotes and eurjpy_orders
# This is just a placeholder for now.
eurjpy_quotes | "WriteEURJPYTrades" >> beam.io.WriteToText(my_options.trades_file)
答案 0 :(得分:1)
对于处理时间序列,通常最好使用State和Timers API。
Original Blog on State and Timers
State and Timers Documentation
在临时连接temporal example上,java中也有一些当前的WIP