如何在索引之前预处理文档?

时间:2017-08-23 12:34:04

标签: elasticsearch twitter logstash elastic-stack

我正在使用logstash和elasticsearch来使用Twitter插件收集推文。我的问题是我收到来自Twitter的文档,我想在索引文档之前进行一些预处理。让我们说这是来自twitter的文档结果:

{
    "tweet": {
       "tweetId": 1025,
       "tweetContent": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch",
       "hashtags": ["stackOverflow", "elasticsearch"],
       "publishedAt": "2017 23 August",
       "analytics": {
           "likeNumber": 400,
           "shareNumber": 100,
       }
    },
    "author":{
       "authorId": 819744,
       "authorAt": "the_expert",
       "authorName": "John Smith",
       "description": "Haha it's a fake description"
    }
}

现在,twitter发送给我的文件中我想生成两个文件: 第一个将在twitter / tweet / 1025中编入索引:

# The id for this document should be the one from tweetId `"tweetId": 1025`
{
    "content": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch", # this field has been renamed
    "hashtags": ["stackOverflow", "elasticsearch"],
    "date": "2017/08/23", # the date has been formated
    "shareNumber": 100 # This field has been flattened
}

第二个将在twitter / author / 819744中编入索引:

# The id for this document should be the one from authorId `"authorId": 819744 `
{
   "authorAt": "the_expert",
   "description": "Haha it's a fake description"
}

我已将输出定义如下:

output {
  stdout { codec => dots }
  elasticsearch {
    hosts => [ "localhost:9200" ]
    index => "twitter"
    document_type => "tweet"
  }
}

如何处理来自Twitter的信息?

修改

所以我的完整配置文件应如下所示:

input {
  twitter {
      consumer_key => "consumer_key"
      consumer_secret => "consumer_secret"
      oauth_token => "access_token"
      oauth_token_secret => "access_token_secret"
      keywords => [ "random", "word"]
      full_tweet => true
      type => "tweet"
  }
}
filter {
  clone {
    clones => ["author"]
  }
  if([type] == "tweet") {
    mutate {
      remove_field => ["authorId", "authorAt"]
    }
  } else {
     mutate {
      remove_field => ["tweetId", "tweetContent"]
     }
  }
}
output {
  stdout { codec => dots }
  if [type] == "tweet" { 
    elasticsearch {
      hosts => [ "localhost:9200" ]
      index => "twitter"
      document_type => "tweet"
      document_id => "%{[tweetId]}"
    }
  } else {
     elasticsearch {
      hosts => [ "localhost:9200" ]
      index => "twitter"
      document_type => "author"
      document_id => "%{[authorId]}"
    }
  }
}

1 个答案:

答案 0 :(得分:2)

您可以在logstash上使用克隆过滤器插件。

使用示例logstash配置文件从stdin获取JSON输入并在stdout上显示输出:

input {
  stdin {
    codec => json
    type => "tweet"
  }
}
filter {
    mutate {
      add_field => {
        "tweetId" => "%{[tweet][tweetId]}"
        "content" => "%{[tweet][tweetContent]}"
        "date" => "%{[tweet][publishedAt]}"
        "shareNumber" => "%{[tweet][analytics][shareNumber]}"
        "authorId" => "%{[author][authorId]}"
        "authorAt" => "%{[author][authorAt]}"
        "description" => "%{[author][description]}"
      }
    }
    date {
        match => ["date", "yyyy dd MMMM"]
        target => "date"
    }
    ruby {
      code => '
        event.set("hashtags", event.get("[tweet][hashtags]"))
    '
    }
    clone {
      clones => ["author"]
    }
    mutate {
      remove_field => ["author", "tweet", "message"]
    }
    if([type] == "tweet") {
      mutate {
        remove_field => ["authorId", "authorAt", "description"]
      }
    } else {
      mutate {
        remove_field => ["tweetId", "content", "hashtags", "date", "shareNumber"]
      }
    }
}
output {
  stdout {
    codec => rubydebug
  }
}

用作输入:

{"tweet": { "tweetId": 1025, "tweetContent": "Hey this is a fake document", "hashtags": ["stackOverflow", "elasticsearch"], "publishedAt": "2017 23 August","analytics": { "likeNumber": 400, "shareNumber": 100 } }, "author":{ "authorId": 819744, "authorAt": "the_expert", "authorName": "John Smith", "description": "fake description" } }

你会得到这两个文件:

    {
           "date" => 2017-08-23T00:00:00.000Z,
       "hashtags" => [
        [0] "stackOverflow",
        [1] "elasticsearch"
    ],
           "type" => "tweet",
        "tweetId" => "1025",
        "content" => "Hey this is a fake document",
    "shareNumber" => "100",
     "@timestamp" => 2017-08-23T20:36:53.795Z,
       "@version" => "1",
           "host" => "my-host"
}
{
    "description" => "fake description",
           "type" => "author",
       "authorId" => "819744",
     "@timestamp" => 2017-08-23T20:36:53.795Z,
       "authorAt" => "the_expert",
       "@version" => "1",
           "host" => "my-host"
}

您也可以使用ruby脚本展平字段,然后在必要时使用mutame重命名。

如果您希望elasticsearch使用authorId和tweetId,而不是默认ID,您可以使用document_id配置elasticsearch输出。

output {
  stdout { codec => dots }
  if [type] == "tweet" { 
    elasticsearch {
      hosts => [ "localhost:9200" ]
      index => "twitter"
      document_type => "tweet"
      document_id => "%{[tweetId]}"
    }
  } else {
     elasticsearch {
      hosts => [ "localhost:9200" ]
      index => "twitter"
      document_type => "tweet"
      document_id => "%{[authorId]}"
    }
  }
}