我正在使用logstash和elasticsearch来使用Twitter插件收集推文。我的问题是我收到来自Twitter的文档,我想在索引文档之前进行一些预处理。让我们说这是来自twitter的文档结果:
{
"tweet": {
"tweetId": 1025,
"tweetContent": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch",
"hashtags": ["stackOverflow", "elasticsearch"],
"publishedAt": "2017 23 August",
"analytics": {
"likeNumber": 400,
"shareNumber": 100,
}
},
"author":{
"authorId": 819744,
"authorAt": "the_expert",
"authorName": "John Smith",
"description": "Haha it's a fake description"
}
}
现在,twitter发送给我的文件中我想生成两个文件: 第一个将在twitter / tweet / 1025中编入索引:
# The id for this document should be the one from tweetId `"tweetId": 1025`
{
"content": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch", # this field has been renamed
"hashtags": ["stackOverflow", "elasticsearch"],
"date": "2017/08/23", # the date has been formated
"shareNumber": 100 # This field has been flattened
}
第二个将在twitter / author / 819744中编入索引:
# The id for this document should be the one from authorId `"authorId": 819744 `
{
"authorAt": "the_expert",
"description": "Haha it's a fake description"
}
我已将输出定义如下:
output {
stdout { codec => dots }
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
}
}
如何处理来自Twitter的信息?
修改
所以我的完整配置文件应如下所示:
input {
twitter {
consumer_key => "consumer_key"
consumer_secret => "consumer_secret"
oauth_token => "access_token"
oauth_token_secret => "access_token_secret"
keywords => [ "random", "word"]
full_tweet => true
type => "tweet"
}
}
filter {
clone {
clones => ["author"]
}
if([type] == "tweet") {
mutate {
remove_field => ["authorId", "authorAt"]
}
} else {
mutate {
remove_field => ["tweetId", "tweetContent"]
}
}
}
output {
stdout { codec => dots }
if [type] == "tweet" {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[tweetId]}"
}
} else {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "author"
document_id => "%{[authorId]}"
}
}
}
答案 0 :(得分:2)
您可以在logstash上使用克隆过滤器插件。
使用示例logstash配置文件从stdin获取JSON输入并在stdout上显示输出:
input {
stdin {
codec => json
type => "tweet"
}
}
filter {
mutate {
add_field => {
"tweetId" => "%{[tweet][tweetId]}"
"content" => "%{[tweet][tweetContent]}"
"date" => "%{[tweet][publishedAt]}"
"shareNumber" => "%{[tweet][analytics][shareNumber]}"
"authorId" => "%{[author][authorId]}"
"authorAt" => "%{[author][authorAt]}"
"description" => "%{[author][description]}"
}
}
date {
match => ["date", "yyyy dd MMMM"]
target => "date"
}
ruby {
code => '
event.set("hashtags", event.get("[tweet][hashtags]"))
'
}
clone {
clones => ["author"]
}
mutate {
remove_field => ["author", "tweet", "message"]
}
if([type] == "tweet") {
mutate {
remove_field => ["authorId", "authorAt", "description"]
}
} else {
mutate {
remove_field => ["tweetId", "content", "hashtags", "date", "shareNumber"]
}
}
}
output {
stdout {
codec => rubydebug
}
}
用作输入:
{"tweet": { "tweetId": 1025, "tweetContent": "Hey this is a fake document", "hashtags": ["stackOverflow", "elasticsearch"], "publishedAt": "2017 23 August","analytics": { "likeNumber": 400, "shareNumber": 100 } }, "author":{ "authorId": 819744, "authorAt": "the_expert", "authorName": "John Smith", "description": "fake description" } }
你会得到这两个文件:
{
"date" => 2017-08-23T00:00:00.000Z,
"hashtags" => [
[0] "stackOverflow",
[1] "elasticsearch"
],
"type" => "tweet",
"tweetId" => "1025",
"content" => "Hey this is a fake document",
"shareNumber" => "100",
"@timestamp" => 2017-08-23T20:36:53.795Z,
"@version" => "1",
"host" => "my-host"
}
{
"description" => "fake description",
"type" => "author",
"authorId" => "819744",
"@timestamp" => 2017-08-23T20:36:53.795Z,
"authorAt" => "the_expert",
"@version" => "1",
"host" => "my-host"
}
您也可以使用ruby脚本展平字段,然后在必要时使用mutame重命名。
如果您希望elasticsearch使用authorId和tweetId,而不是默认ID,您可以使用document_id配置elasticsearch输出。
output {
stdout { codec => dots }
if [type] == "tweet" {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[tweetId]}"
}
} else {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[authorId]}"
}
}
}