使用拦截器过滤Flume中的日志文件

时间:2013-07-22 06:50:45

标签: hadoop flume

我有一个http服务器写日志文件然后我使用Flume加载到HDFS 首先,我想根据我的标题或正文中的数据过滤数据。我读到我可以使用带有正则表达式的拦截器来执行此操作,有人可以解释我需要做什么吗?我是否需要编写覆盖Flume代码的Java代码?

另外我想取数据并根据标题发送到另一个接收器(即source = 1转到sink1而source = 2转到sink2)这是怎么做到的?

谢谢,

希蒙

2 个答案:

答案 0 :(得分:11)

您无需编写Java代码来过滤事件。使用Regex Filtering Interceptor过滤正文与正则表达式匹配的事件:

agent.sources.logs_source.interceptors = regex_filter_interceptor
agent.sources.logs_source.interceptors.regex_filter_interceptor.type = regex_filter
agent.sources.logs_source.interceptors.regex_filter_interceptor.regex = <your regex>
agent.sources.logs_source.interceptors.regex_filter_interceptor.excludeEvents = true

要根据标题路由事件,请使用Multiplexing Channel Selector

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4

这里带有标题“state”=“CZ”的事件进入频道“c1”,“state”=“US” - “c2”和“c3”,所有其他 - 到“c4”。

这样您还可以按标头过滤事件 - 只需将特定标头值路由到频道,即指向Null Sink

答案 1 :(得分:0)

您可以使用水槽通道选择器将事件简单地路由到不同的目的地。或者您可以将多个水槽代理链接在一起以实现复杂的路由功能。 但链式水槽代理将变得有点难以维护(资源使用和水槽拓扑)。 您可以查看flume-ng router sink,它可能会提供您想要的功能。

首先,按flume interceptor

在事件标题中添加特定字段
a1.sources = r1 r2
a1.channels = c1 c2
a1.sources.r1.channels =  c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = NEW_YORK
a1.sources.r2.channels =  c2
a1.sources.r2.type = seq
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = datacenter
a1.sources.r2.interceptors.i2.value = BERKELEY

然后,您可以设置您的水槽通道选择器,如:

a2.sources = r2
a2.sources.channels = c1 c2 c3 c4
a2.sources.r2.selector.type = multiplexing
a2.sources.r2.selector.header = datacenter
a2.sources.r2.selector.mapping.NEW_YORK = c1
a2.sources.r2.selector.mapping.BERKELEY= c2 c3
a2.sources.r2.selector.default = c4

或者,您可以设置avro-router sink,如:

agent.sinks.routerSink.type = com.datums.stream.AvroRouterSink
agent.sinks.routerSink.hostname = test_host
agent.sinks.routerSink.port = 34541
agent.sinks.routerSink.channel = memoryChannel

# Set sink name
agent.sinks.routerSink.component.name = AvroRouterSink

# Set header name for routing
agent.sinks.routerSink.condition = datacenter

# Set routing conditions
agent.sinks.routerSink.conditions = east,west
agent.sinks.routerSink.conditions.east.if = ^NEW_YORK
agent.sinks.routerSink.conditions.east.then.hostname = east_host
agent.sinks.routerSink.conditions.east.then.port = 34542
agent.sinks.routerSink.conditions.west.if = ^BERKELEY
agent.sinks.routerSink.conditions.west.then.hostname = west_host
agent.sinks.routerSink.conditions.west.then.port = 34543