使用Kafka实时流式传输多个数据源

时间:2017-03-17 11:21:10

标签: streaming apache-kafka monitoring

我们计划用apache kafka建立一个实时监控系统。总体思路是将数据从多个数据源推送到kafka并执行数据质量检查。我对这个架构的问题很少

  1. 从多个来源流式传输数据的最佳方法是什么,主要包括java应用程序,oracle数据库,rest api,日志文件到apache kafka?请注意,每个客户端部署都包含每个此类数据源。因此,将数据推送到kafka的数据源数量将等于客户数量* x,其中x是我列出的数据源类型。理想情况下,推送方法最适合而不是拉动方法。在拉方法中,目标系统必须配置各种源系统的凭证,这是不实际的
  2. 我们如何处理失败?
  3. 我们如何对收到的邮件进行数据质量检查?对于例如如果某条消息没有所有必需的属性,则可以丢弃该消息并提出警报以供维护团队检查。
  4. 请告诉我您的专家意见。谢谢!

1 个答案:

答案 0 :(得分:1)

我认为这里最好的方法是使用Kafka connect:link 但这是一种拉动方式: Kafka Connect sources are pull-based for a few reasons. First, although connectors should generally run continuously, making them pull-based means that the connector/Kafka Connect decides when data is actually pulled, which allows for things like pausing connectors without losing data, brief periods of unavailability as connectors are moved, etc. Second, in distributed mode the tasks that pull data may need to be rebalanced across workers, which means they won't have a consistent location or address. While in standalone mode you could guarantee a fixed network endpoint to work with (and point other services at), this doesn't work in distributed mode where tasks can be moving around between workers. Ewen