如何在logstash中解析tmx文件(转换数据的xml文件)

时间:2017-08-25 13:07:23

标签: elasticsearch logstash logstash-file

我使用TMX文件(翻译数据的xml文件)作为Logstash中的源代码来索引Elasticsearch中的数据。

示例TMX文件如下所示,

////// MODIS COLLECTION ////////
var ci = ee.ImageCollection('MOD09GA').filterDate('2015-10-01', '2016 08-05');

// Function to exclude MODIS swath gaps  
function filterEmpty(imageCollection, polygon) {
var scale = 500
return imageCollection.map(function(i) {
  return i.set('first_value', i.select(0)
      .reduceRegion({reducer: ee.Reducer.firstNonNull(), geometry: polygon, scale: scale})
     .values().get(0))
}).filter(ee.Filter.eq('first_value', 1))
}
var c = filterEmpty(ci, Turkana);
print(c);

我需要做的是将每个<?xml version="1.0" encoding="UTF-8"?> <tmx version="1.4"> <header creationtool="ModernMT - modernmt.eu" creationtoolversion="1.0" datatype="plaintext" o-tmf="ModernMT" segtype="sentence" adminlang="en-us" srclang="en-GB"/> <body> <tu srclang="en-GB" datatype="plaintext" creationdate="20121019T114713Z"> <tuv xml:lang="en-GB"> <seg>The purpose of the standard is to establish and define the requirements for the provision of quality services by translation service providers.</seg> </tuv> <tuv xml:lang="it"> <seg>L'obiettivo dello standard è stabilire e definire i requisiti affinché i fornitori di servizi di traduzione garantiscano servizi di qualità.</seg> </tuv> </tu> <tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z"> <tuv xml:lang="en-GB"> <seg>With 1,800 experienced and qualified resources translating regularly into over 200 language combinations, you can count on us for high quality professional translation services.</seg> </tuv> <tuv xml:lang="it"> <seg>Abbiamo 1.800 professionisti esperti e qualificati che traducono regolarmente in oltre 200 combinazioni linguistiche; perciò, se cercate la qualità, potete contare su di noi.</seg> </tuv> </tu> <tu srclang="en-GB" datatype="plaintext" creationdate="20111223T112746Z"> <tuv xml:lang="en-GB"> <seg>Access our section of useful links</seg> </tuv> <tuv xml:lang="it"> <seg>Da qui potrete accedere a una sezione che propone link a siti che possono essere di vostro interesse</seg> </tuv> </tu> 块作为事件访问,其中内部的两个<tu>块将用作数据字段。存储在第一个<tuv>块中的数据将在ES中编入索引作为源语言数据字段,存储在第二个tuv块中的数据是目标语言数据字段。

TMX文档可以包含超过10000个tuv块。

我在使用xml过滤器时遇到了麻烦,现在看起来像这样,

tuv

以下是我的索引模板

的一部分
input {
    file {
        path => "/en-gb_pt-pt/81384/81384.xml"
            start_position => "beginning"
        codec => multiline {
                pattern => "<tu>" 
                    negate => "true"
                    what => "previous"
        }
    }
}

filter {
    xml {
        source => "message"
            target => "xml_content"
            xpath => [ "//seg", "seg" ] 
    }
}

output {
    stdout {
            #codec => json
            codec => rubydebug
    }
}

1 个答案:

答案 0 :(得分:2)

使用grok或剖析过滤器建议一个简单的方法。

filter {
    dissect {
        mapping => { "message" => "%{}<seg>%{src}</seg>%{}<seg>%{trg}</seg>%{}" }
    }
    mutate {
       remove_field => ["message"]
    }
}

你得到:

{
          "path" => "/en-gb_pt-pt/81384/81384.xml",
    "@timestamp" => 2017-08-25T15:07:34.567Z,
           "src" => "The purpose of the standard is to establish and define the requirements for the provision of quality services by translation service providers.",
      "@version" => "1",
          "host" => "my_host",
           "trg" => "L'obiettivo dello standard è stabilire e definire i requisiti affinché i fornitori di servizi di traduzione garantiscano servizi di qualità.",
          "tags" => [
        [0] "multiline"
    ]
}