Question

使用pyparsing模块解析Snort日志时遇到问题。

问题在于分离Snort日志（具有多行条目，由空行分隔）和获取pyparsing以将每个条目解析为整个块，而不是逐行读取并期望语法与每个条目一起工作线（显然，它没有。）

我尝试将每个块转换为临时字符串，剥离每个块内的换行符，但它拒绝正确处理。我可能完全处于错误的轨道上，但我不这么认为（类似的形式对于syslog类型的日志非常有效，但这些是单行条目，因此适合您的基本文件迭代器/行处理）

以下是我的日志和代码示例：

[**] [1:486:4] ICMP Destination Unreachable Communication with Destination Host is Administratively Prohibited [**]
[Classification: Misc activity] [Priority: 3] 
08/03-07:30:02.233350 172.143.241.86 -> 63.44.2.33
ICMP TTL:61 TOS:0xC0 ID:49461 IpLen:20 DgmLen:88
Type:3  Code:10  DESTINATION UNREACHABLE: ADMINISTRATIVELY PROHIBITED HOST FILTERED
** ORIGINAL DATAGRAM DUMP:
63.44.2.33:41235 -> 172.143.241.86:4949
TCP TTL:61 TOS:0x0 ID:36212 IpLen:20 DgmLen:60 DF
Seq: 0xF74E606
(32 more bytes of original packet)
** END OF DUMP

[**] ...more like this [**]

更新的代码：

def snort_parse(logfile):
    header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + Suppress("]") + Regex(".*") + Suppress("[**]")
    cls = Optional(Suppress("[Classification:") + Regex(".*") + Suppress("]"))
    pri = Suppress("[Priority:") + integer + Suppress("]")
    date = integer + "/" + integer + "-" + integer + ":" + integer + "." + Suppress(integer)
    src_ip = ip_addr + Suppress("->")
    dest_ip = ip_addr
    extra = Regex(".*")

    bnf = header + cls + pri + date + src_ip + dest_ip + extra

    def logreader(logfile):
        chunk = []
        with open(logfile) as snort_logfile:
            for line in snort_logfile:
                if line !='\n':
                    line = line[:-1]
                    chunk.append(line)
                    continue
                else:
                    print chunk
                    yield " ".join(chunk)
                    chunk = []

    string_to_parse = "".join(logreader(logfile).next())
    fields = bnf.parseString(string_to_parse)
    print fields

任何帮助，指针，RTFM，你正在做错了等等，非常感谢。

Answer 1

import pyparsing as pyp
import itertools

integer = pyp.Word(pyp.nums)
ip_addr = pyp.Combine(integer+'.'+integer+'.'+integer+'.'+integer)

def snort_parse(logfile):
    header = (pyp.Suppress("[**] [")
              + pyp.Combine(integer + ":" + integer + ":" + integer)
              + pyp.Suppress(pyp.SkipTo("[**]", include = True)))
    cls = (
        pyp.Suppress(pyp.Optional(pyp.Literal("[Classification:")))
        + pyp.Regex("[^]]*") + pyp.Suppress(']'))

    pri = pyp.Suppress("[Priority:") + integer + pyp.Suppress("]")
    date = pyp.Combine(
        integer+"/"+integer+'-'+integer+':'+integer+':'+integer+'.'+integer)
    src_ip = ip_addr + pyp.Suppress("->")
    dest_ip = ip_addr

    bnf = header+cls+pri+date+src_ip+dest_ip

    with open(logfile) as snort_logfile:
        for has_content, grp in itertools.groupby(
                snort_logfile, key = lambda x: bool(x.strip())):
            if has_content:
                tmpStr = ''.join(grp)
                fields = bnf.searchString(tmpStr)
                print(fields)

snort_parse('snort_file')

产量

[['1:486:4', 'Misc activity', '3', '08/03-07:30:02.233350', '172.143.241.86', '63.44.2.33']]

Answer 2

你有一些正则表达式无法学习，但希望这不会太痛苦。你思考的最大罪魁祸首就是使用这个结构：

some_stuff + Regex(".*") + 
                 Suppress(string_representing_where_you_want_the_regex_to_stop)

pyparsing解析器中的每个subparser都是独立的，并按顺序通过传入文本。因此，正则表达式术语无法展望下一个表达式，以查看'*'重复应停止的位置。换句话说，表达式Regex(".*")将只读到行的结尾，因为这是".*"在没有指定多行的情况下停止的地方。

在pyparsing中，这个概念是使用SkipTo实现的。以下是您的标题行的写法：

header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
             Suppress("]") + Regex(".*") + Suppress("[**]")

您的“。*”问题可以通过将其更改为：

来解决

header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
             Suppress("]") + SkipTo("[**]") + Suppress("[**]")

同样的事情。

最后一个错误，你对日期的定义很短一个'：'+整数：

date = integer + "/" + integer + "-" + integer + ":" + integer + "." + 
          Suppress(integer)

应该是：

date = integer + "/" + integer + "-" + integer + ":" + integer + ":" + 
          integer + "." + Suppress(integer)

我认为这些更改足以开始解析您的日志数据。

以下是其他一些风格建议：

你有很多重复的Suppress("]")表达式。我已经开始在一个非常紧凑且易于维护的声明中定义我所有可压缩的标点符号，如下所示：

LBRACK,RBRACK,LBRACE,RBRACE = map(Suppress,"[]{}")

（展开以添加您喜欢的任何其他标点字符）。现在我可以通过它们的符号名称来使用这些字符，并且我发现生成的代码更容易阅读。

您使用header = Suppress("[**] [") + ...开始标题。我从不喜欢以这种方式在文字中嵌入空格，因为它绕过了一些解析稳健性，pyparsing为你提供了自动空白跳过。如果由于某种原因“[**]”和“[”之间的空格被更改为使用2或3个空格或制表符，那么您的被抑制的文字将会失败。将此与之前的建议相结合，标题将以

开头

header = Suppress("[**]") + LBRACK + ...

我知道这是生成的文本，因此这种格式的变化不太可能，但它对于pyparsing的优势更有利。

解析完字段后，开始将结果名称分配给解析器中的不同元素。这将使 lot 更容易在之后获取数据。例如，将cls更改为：

cls = Optional(Suppress("[Classification:") + 
             SkipTo(RBRACK)("classification") + RBRACK)

允许您使用fields.classification访问分类数据。

Answer 3

好吧，我不知道Snort还是pyparsing，如果我说些蠢话，请提前道歉。我不清楚问题是pyparsing无法处理条目，还是您无法以正确的格式将它们发送到pyparsing。如果是后者，为什么不做这样的事情？

def logreader( path_to_file ):
    chunk = [ ]
    with open( path_to_file ) as theFile:
        for line in theFile:
            if line:
                chunk.append( line )
                continue
            else:
                yield "".join( *chunk )
                chunk = [ ]

当然，如果您需要在将每个块发送到pyparsing之前修改它们，您可以在yield之前执行此操作。

使用PyParsing解析Snort日志

3 个答案: