从纯文本日志文件中提取xml块

时间:2015-02-04 07:35:11

标签: regex xml linux sed grep

我有一个包含SOAP请求/响应条目的日志:

[2015-02-03 19:05:13] TIME:03.02.2015 19:05:13,
                   RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
,
                   uid:0de7d51a-abb6-11e4-a436-005056936d96,
                   ===

我想将所有xmls提取到一个大的xml文件中(提取块并用root ...标记包装)。但我也需要一个日志记录日期。

我想(我可以用手添加root xmlns属性)来实现相同的结果:

<Records xmlns="" ...>
    <Record datetime="2015-02-03 19:05:13">
        <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body>
            <!-- Other xml data -->
        </SOAP-ENV:Body></SOAP-ENV:Envelope>
    </Record>
    ...
</Records>

2 个答案:

答案 0 :(得分:1)

您可以使用awk

执行此操作 例如,

创建一个名为awkscript的文件并添加以下代码

BEGIN{print "\n<Records xmlns=\""}
$0~/^\[[0-9]{1,4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\]/{
print "\t<Record datetime=\"" substr($1,2,19),substr($3,1)"\">"
getline
while ($0!~/^\[[0-9]{1,4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\]/ && $0!~/^<\/*SOAP-ENV:.*/){getline}
while($0~/^<\/*SOAP-ENV:.*/){print "\t\t" $0;getline};{print "\t </Record>"}}
END{print "<\/Records>"}

在shell中运行带有文件的脚本

awk -f path_to_awkscript  path_to_xml_file > path_to_new_file

示例

将脚本与包含以下数据的xml文件一起使用

[2015-02-03 19:05:13] TIME:03.02.2015 19:05:13,
                   RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
,
                   uid:0de7d51a-abb6-11e4-a436-005056936d96,
                   ===

[2014-11-03 19:05:13] TIME:03.02.2015 19:05:13,
                   RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
,
                   uid:0de7d51a-abb6-11e4-a436-005056936d96,
                   ===


[2014-12-15 19:05:13] TIME:03.02.2015 19:05:13,
                   RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
,
                   uid:0de7d51a-abb6-11e4-a436-005056936d96,
                   ===

</SOAP-ENV:Body></SOAP-ENV:Envelope>

<强>结果

<Records xmlns="
    <Record datetime="2015-02-03 TIME:03.02.2015">
        <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
        </SOAP-ENV:Body></SOAP-ENV:Envelope>
     </Record>
    <Record datetime="2014-11-03 TIME:03.02.2015">
        <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
        </SOAP-ENV:Body></SOAP-ENV:Envelope>
     </Record>
    <Record datetime="2014-12-15 TIME:03.02.2015">
        <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
        </SOAP-ENV:Body></SOAP-ENV:Envelope>
     </Record>
</Records>

答案 1 :(得分:0)

我找不到像grep或sed这样的linux控制台工具的解决方案。 所以我写了一个python脚本。

import sys
import re


def write_xml_log(out_path, lines):
    u"""
    Joins xml chunks into one document.
    """
    out_fh = open(out_path, 'w+')
    out_fh.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    out_fh.write('<LogRecords>\n')
    out_fh.writelines((
        '<LogRecord>\n{}\n</LogRecord>\n'.format(line) for line in lines))
    out_fh.write('</LogRecords>')
    out_fh.close()


def prepare_xml_chunks(log_path):
    u"""
    Prepares xml-chunks.
    """
    log_fh = open(log_path)

    record_date_re = re.compile('^\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]')
    envelope_start_re = re.compile('(<(?:[\w_-]+:)?Envelope)(.*)$')
    envelope_end_re = re.compile('(.*</(?:[\w_-]+:)?Envelope>)')
    envelope_complete_re = re.compile(
        '(<(?:[\w_-]+:)?Envelope)(.*?>.*?</(?:[\w_-]+:)?Envelope>)')

    record_date = ''
    record_envelope = ''
    state_in_envelope = False

    for line in log_fh:
        match_date = record_date_re.match(line)
        match_envelope_start = envelope_start_re.match(line)
        match_envelope_end = envelope_end_re.match(line)
        match_envelope_complete = envelope_complete_re.match(line)

        if match_date:
            record_date = match_date.group(1)

        if not state_in_envelope:
            # One-line envelope
            if match_envelope_complete:
                state_in_envelope = False
                record_envelope = ''

                yield '{} datetime="{}" {}\n'.format(
                    match_envelope_complete.group(1),
                    record_date,
                    match_envelope_complete.group(2))

            # Multi-line envelope start.
            elif match_envelope_start:
                state_in_envelope = True
                record_envelope = '{} datetime="{}" {}\n'.format(
                    match_envelope_start.group(1),
                    record_date,
                    match_envelope_start.group(2))

            # Problem situation.
            elif match_envelope_end:
                raise Exception('Envelope close tag without open tag.')
        else:
            # Multi-line envelope continue.
            if not match_envelope_end:
                record_envelope += line

            # Multi-line envelope end.
            else:
                record_envelope += match_envelope_end.group(1)
                yield '{}\n'.format(record_envelope)

                record_envelope = ''
                state_in_envelope = False

    log_fh.close()


write_xml_log(sys.argv[2], prepare_xml_chunks(sys.argv[1]))