Python重新命名为group - 一场贪婪的比赛

时间:2014-05-12 05:52:11

标签: python regex

出于某种原因,我需要使用python re。

提取xml doc中的字段

这是一个例子。字符串我将应用正则表达式:

payload2 = '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'

您在上面看到的一些字段类似于&#39; clientIP&#39;可能并不总是存在。

我提出的正则表达式是:

PAT3 = re.compile(r'.+(event="(?P<event_code>\S*?)"){1}[\S\s]+?(path="(?P<path>[\s\S]+?)"){0,1}[\S\s]+(clientIP="(?P<client_ip>[\S\s]+?)"){0,1}.*', re.UNICODE)

m1 = PAT3.search(payload2)
print m1.groupdict()

输出:

{'path': '\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db', 'client_ip': None, 'event_code': '0x80'}

但是当{1} {0, 1}之后我放(?P<client_ip>[\S\s]+?)")代替{{1}}时。然而,当clientIP不存在时,这会使案件失败。

关于如何在存在字段或不存在字段的情况下使正则表达式工作的任何想法?

2 个答案:

答案 0 :(得分:0)

首先,我必须给你the standard warning against parsing XML with regular expressions,但如果你已经死了......

您可能不想使用[\S\s],因为它会匹配任何内容,包括超过引用。为了防止这种情况,你做了非贪心,但有一个更好的解决方案:只需使它不匹配引号[^"]。另请注意,您可以将{0,1}替换为?

答案 1 :(得分:0)

我的建议:

停止尝试做一个大的单行正则表达式。

分解代码非常简单,因此它不仅更具可读性,而且更容易

我的代码版本

payloads = [
    '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>',
    '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'
]


def scrape_xml(payload):
    import re
    ipv4 = r'clientIP="(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'

    pat3 = dict()
    pat3['event_code'] = r'event="(0[xX][0-9a-fA-F]+?)"'
    pat3['path'] = r'path="(.*?)"'
    pat3['client_ip'] = ipv4

    matches = {}
    for index, regex in enumerate(pat3):
        matches[index] = re.search(
            pattern=pat3[regex],
            string=payload,
            flags=re.UNICODE
        )

    for index in matches:
        if not index:
            print "\n"
        if matches[index] is None:
            pass
        else:
            print matches[index].group(0)

for p in payloads:
    scrape_xml(p)

输出:

  

路径= “\ c2_emcvnx.ntaplion.prv \ CHECK $ \ demoshare1 \工程\标杆\ Thumbs.db” 这个
  event =“0x80”

     

路径= “\ c2_emcvnx.ntaplion.prv \ CHECK $ \ demoshare1 \工程\标杆\ Thumbs.db” 这个
  clientIP = “172.26.64.233”
  事件= “0x80的”