我使用Logstash解析嵌套的多行XML文档并将其转发给Elasticsearch。
这样的文件可能如下所示:
<?xml version="1.0" encoding="UTF-8"?>
<Root_Element xmlns:ns2="some_namespace">
<creationTime>2016-02-05T00:27:29.752Z</creationTime>
<provider>some_provider</provider>
<Event>
<eventId>111999_0</eventId>
<something_interesting some_attribute="foo" other_attribute="bar" yet_another_attribute="whatever"/>
<eventStartTime>2016-01-22T04:00:00Z</eventStartTime>
<eventStopTime>2016-02-19T18:00:00Z</eventStopTime>
<location loc_attribute="fooz" other_loc_attribute="unknown" and_one_more="hooray">
<xy lat="51.514728" lon="-0.073563" name="some_name" direction="north"/>
</location>
<comment language="en">Some text comment.</comment>
<comment language="en">Some other text comment.</comment>
</Event>
</Root_Element>
要在Logstash中阅读此文档,我使用以下配置文件:
##########
# INPUT
##########
input {
# listen on tcp
tcp {
port => 9000
# do not split events on newlines but read multiple lines at once instead
# events start with <Event>, everything that is not <Event> or </Root_Element> belongs to the previous event
codec => multiline {
pattern => "(?=<Event>)(?=</Root_Element>)"
negate => "true"
what => "previous"
}
}
}
##########
# FILTER
##########
filter {
# parse event input as Xml
xml {
source => "message"
remove_namespaces => true
store_xml => true
target => "parsed"
}
# split event by Event tag
split {
field => "parsed[Event]"
}
# flatten the nested event structure on to the root level
ruby {
code => "
event['parsed']['Event'].each do |key, value|
event[key] = value[0]
end
"
}
# remove unnecessary fields from the output
mutate {
remove_field => ["message", "parsed", "host", "port", "tags"]
}
}
##########
# OUTPUT
##########
output {
# forward event to the elasticsearch host
# elasticsearch {
# hosts => ["elasticsearch"]
# }
# write event on stdout for debugging
stdout {
codec => rubydebug
}
}
为了对此进行测试,只需将上面的XML内容保存到文件中,使用提供的配置启动Logstash,然后通过cat filename.xml | nc <logstash_ip_or_hostname> 9000
将XML内容发送到logstash。
这导致Logstash中的以下输出:
{
"@timestamp" => "2016-05-03T11:51:39.777Z",
"@version" => "1",
"eventId" => "111999_0",
"something_interesting" => {
"some_attribute" => "foo",
"other_attribute" => "bar",
"yet_another_attribute" => "whatever"
},
"eventStartTime" => "2016-01-22T04:00:00Z",
"eventStopTime" => "2016-02-19T18:00:00Z",
"location" => {
"loc_attribute" => "fooz",
"other_loc_attribute" => "unknown",
"and_one_more" => "hooray",
"xy" => [
[0] {
"lat" => "51.514728",
"lon" => "-0.073563",
"name" => "some_name",
"direction" => "north"
}
]
},
"comment" => {
"language" => "en",
"content" => "Some text comment."
}
}
但事实并非如此,事件包含字符串(例如eventId
),对象(例如something_interesting
)和对象数组(例如location => xy
)值。我宁愿让最终事件变得扁平而不是嵌套,因为在Elasticsearch和Kibana中处理嵌套数据有一些问题。
此外,原始XML内容有两个<comment>
标签,但第二个标签由于某种原因没有进入输出。
我希望输出看起来像是:
{
"@timestamp" => "2016-05-03T12:00:54.182Z",
"@version" => "1",
"eventId" => "111999_0",
"something_interesting.some_attribute" => "foo",
"something_interesting.other_attribute" => "bar",
"something_interesting.yet_another_attribute" => "whatever",
"eventStartTime" => "2016-01-22T04:00:00Z",
"eventStopTime" => "2016-02-19T18:00:00Z",
"location.loc_attribute" => "fooz",
"location.other_loc_attribute" => "unknown",
"location.and_one_more" => "hooray",
"xy.0.lat" => "51.514728",
"xy.0.lon" => "-0.073563",
"xy.0.name" => "some_name",
"xy.0.direction" => "north",
"comment.0.language" => "en",
"comment.0.content" => "Some text comment.",
"comment.1.language" => "en",
"comment.1.content" => "Some other text comment.",
}
键中的分隔符不一定是点,但可以是其他任何东西(我现在不确定是否允许点。)
有关如何实现这一目标的任何建议吗?我是否必须为此转换编写自定义Ruby插件,或者也可以使用内置的插件来完成?