我正在尝试在两个标签<example>text</example>
之间拉文本。我找到了可以使用正则表达式执行此操作的帖子;但是,当我尝试在Python中使用它时,我不得不转义字符。
original regex : run = re.findall("(?<=(<runs>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</runs>))", text)
完整代码:
#text is a text file but there is too much data to process to post it here
text = "<os>Windows Vista or Windows 7</os><filename>AS_ENGINE.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:34:34Z</atime><runs>1</runs><filenames><file>
<os>Windows Vista or Windows 7</os><filename>CHRMSTP.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:15:32Z</atime><runs>2</runs><filenames>
<os>Windows Vista or Windows 7</os><filename>RUNDLL32.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:07:35Z</atime><runs>1</runs><filenames><file>"
soup = BeautifulSoup(text, "lxml")
for x in soup.find_all("runs"):
print("Orginal ", x)
for x in soup.find_all("dir"):
print("Orginal ", x)
for x in soup.find_all("filename"):
print("Orginal ", x)
然后我想将某些标签写入csv ...
fieldnames = 'File Nmae','Number of runs','File Path'
with open("C:\\ProgramData\\processed\\winprefetch.csv", 'w', newline='', encoding="utf8") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(fieldnames)
writer.writerows([[diskimage_name * row], filename, numberofruns,file]
答案 0 :(得分:3)
Parsing XML with regex is a poor approach。 Python有一个名为Beautiful Soup的XML解析库,它将准确地执行此任务:
from bs4 import BeautifulSoup
text = '<filename>MPSIGSTUB.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:34:33Z</atime><runs>1</runs><filenames><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CNTDLL.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CKERNEL32.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CAPISETSCHEMA.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CKERNELBASE.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CLOCALE.NLS</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD\x5CINSTALL\x5CMPSIGSTUB.EXE</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CADVAPI32.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CMSVCRT.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CSECHOST.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CRPCRT4.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CVERSION.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CCRYPTBASE.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CTEMP\x5CMPSIGSTUB.LOG</file></filenames><volume><path>\x5CDEVICE\x5CHARDDISKVOLUME1</path><creation>2019-04-28T22:00:18Z</creation><serial_number>84c53be0</serial_number><dirnames><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5C$EXTEND</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD\x5CINSTALL</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CTEMP</dir></dirnames></volume>'
soup = BeautifulSoup(text, "lxml")
print(soup.find("runs").text)
for x in soup.find_all("dir"):
print(x) # or x.text if you're only interested in the element contents
输出:
1
<dir>\DEVICE\HARDDISKVOLUME1\$EXTEND</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\SOFTWAREDISTRIBUTION</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\SOFTWAREDISTRIBUTION\DOWNLOAD</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\SOFTWAREDISTRIBUTION\DOWNLOAD\INSTALL</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\SYSTEM32</dir>
<dir>\DEVICE\HARDDISKVOLUME1\WINDOWS\TEMP</dir>
答案 1 :(得分:0)
尝试一下:
import re
text ="<filename>MPSIGSTUB.EXE</filename><header_size>240</header_size><atime>2019-04-28T13:34:33Z</atime><runs>1</runs><filenames><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CNTDLL.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CKERNEL32.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CAPISETSCHEMA.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CKERNELBASE.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CLOCALE.NLS</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD\x5CINSTALL\x5CMPSIGSTUB.EXE</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CADVAPI32.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CMSVCRT.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CSECHOST.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CRPCRT4.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CVERSION.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32\x5CCRYPTBASE.DLL</file><file>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CTEMP\x5CMPSIGSTUB.LOG</file></filenames><volume><path>\x5CDEVICE\x5CHARDDISKVOLUME1</path><creation>2019-04-28T22:00:18Z</creation><serial_number>84c53be0</serial_number><dirnames><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5C$EXTEND</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSOFTWAREDISTRIBUTION\x5CDOWNLOAD\x5CINSTALL</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CSYSTEM32</dir><dir>\x5CDEVICE\x5CHARDDISKVOLUME1\x5CWINDOWS\x5CTEMP</dir></dirnames></volume>"
#regx
find = re.findall("(?<=(<runs>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]\"'+–/\/®°⁰!?{}|`~]| )+?(?=(</runs>))", text)
print(find)
您非常接近-似乎您在"
上遇到了麻烦。另外,尽管我不知道您的问题的详细信息,但我认为可以简化正则表达式。例如:
import re
text ="<filename>MPSIGSTUB.EXE</filename><runs>0</runs>asdf<runs>1</runs>"
#regx
matches = re.finditer("<runs>(.*?)</runs>", text)
for match in matches:
print(match.group(1))
# output:
# 0
# 1