出于某种原因,我需要使用python re。
提取xml doc中的字段这是一个例子。字符串我将应用正则表达式:
payload2 = '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'
您在上面看到的一些字段类似于&#39; clientIP&#39;可能并不总是存在。
我提出的正则表达式是:
PAT3 = re.compile(r'.+(event="(?P<event_code>\S*?)"){1}[\S\s]+?(path="(?P<path>[\s\S]+?)"){0,1}[\S\s]+(clientIP="(?P<client_ip>[\S\s]+?)"){0,1}.*', re.UNICODE)
m1 = PAT3.search(payload2)
print m1.groupdict()
输出:
{'path': '\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db', 'client_ip': None, 'event_code': '0x80'}
但是当{1}
{0, 1}
之后我放(?P<client_ip>[\S\s]+?)")
代替{{1}}时。然而,当clientIP不存在时,这会使案件失败。
关于如何在存在字段或不存在字段的情况下使正则表达式工作的任何想法?
答案 0 :(得分:0)
首先,我必须给你the standard warning against parsing XML with regular expressions,但如果你已经死了......
您可能不想使用[\S\s]
,因为它会匹配任何内容,包括超过引用。为了防止这种情况,你做了非贪心,但有一个更好的解决方案:只需使它不匹配引号:[^"]
。另请注意,您可以将{0,1}
替换为?
。
答案 1 :(得分:0)
停止尝试做一个大的单行正则表达式。
分解代码非常简单,因此它不仅更具可读性,而且更容易。
payloads = [
'<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>',
'<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'
]
def scrape_xml(payload):
import re
ipv4 = r'clientIP="(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'
pat3 = dict()
pat3['event_code'] = r'event="(0[xX][0-9a-fA-F]+?)"'
pat3['path'] = r'path="(.*?)"'
pat3['client_ip'] = ipv4
matches = {}
for index, regex in enumerate(pat3):
matches[index] = re.search(
pattern=pat3[regex],
string=payload,
flags=re.UNICODE
)
for index in matches:
if not index:
print "\n"
if matches[index] is None:
pass
else:
print matches[index].group(0)
for p in payloads:
scrape_xml(p)
输出:
路径= “\ c2_emcvnx.ntaplion.prv \ CHECK $ \ demoshare1 \工程\标杆\ Thumbs.db” 这个
event =“0x80”路径= “\ c2_emcvnx.ntaplion.prv \ CHECK $ \ demoshare1 \工程\标杆\ Thumbs.db” 这个
clientIP = “172.26.64.233”
事件= “0x80的”