使用LXML解析XML并使用Python

时间:2018-04-21 18:18:47

标签: python xml sqlite iterator lxml

我正在尝试解析一个包含Directories_name,Files_name,permissions,creation_date_time,username,file_size,dir_size

列表的巨大XML文件

xml的结构如下:

<parent_dir>
   <file_1>
   <sub-dir_1>
           <file_2>
            <sub-dir_2>
                ....
                  .....

实际文件如下所示:

<browse path="">
  <dir date="2018-04-17 23:31:59" internal="0" group="TrustedInstaller" protection="drwxrwxrwx" name="C:" size="593181949" links="0" user="unknown">
    <dir date="2017-12-13 23:30:44" internal="0" group="unknown" protection="drwxrwxr-x" name="Documents and Settings" size="174" links="0" user="SYSTEM">
      <file date="2017-03-18 22:01:11" internal="0" group="unknown" protection="-rwxrwxr-x" name="desktop.ini" size="174" links="0" user="Administrators" />
    </dir>
    <dir date="2017-12-14 03:17:04" internal="0" group="None" protection="d--x------" name="Test_Software" size="516708762" links="0" user="srt">
      <file date="2017-02-09 14:58:53" internal="0" group="None" protection="----------" name="26.avi" size="13263184" links="0" user="srt" />
      <file date="2016-11-01 00:31:40" internal="0" group="None" protection="----------" name="6.avi" size="13569536" links="0" user="srt" />
      <dir date="2017-12-13 23:41:27" internal="0" group="None" protection="d--x------" name=".vs" size="5120" links="0" user="srt">
        <dir date="2017-12-13 23:41:27" internal="0" group="None" protection="d--x------" name="Forest_Protector" size="5120" links="0" user="srt">
          <dir date="2017-12-13 23:41:27" internal="0" group="None" protection="d--x------" name="v14" size="5120" links="0" user="srt">
            <file date="2017-12-12 14:35:36" internal="0" group="None" protection="----------" name=".suo" size="5120" links="0" user="srt" />
          </dir>
        </dir>
      </dir>
      <dir date="2017-12-14 03:17:15" internal="0" group="None" protection="d--x------" name="Debug" size="379090369" links="0" user="srt">
        <file date="2017-12-14 03:06:03" internal="0" group="None" protection="-rwx------" name="Current_Frame1 (2).mp4" size="321612800" links="0" user="Administrators" />
        <file date="2018-04-16 21:35:17" internal="0" group="None" protection="-rwx------" name="Current_Frame1.avi" size="94102" links="0" user="Administrators" />
        <dir date="2017-12-14 03:17:20" internal="0" group="None" protection="d--x------" name="Fire" size="7502723" links="0" user="srt">
          <file date="2017-12-12 21:35:13" internal="0" group="None" protection="----------" name="Fire_Detected_02_05_12.bmp" size="921654" links="0" user="srt" />
          <file date="2017-12-12 21:35:13" internal="0" group="None" protection="----------" name="Fire_Detected_02_05_13.bmp" size="921654" links="0" user="srt" />
        </dir>
        <dir date="2017-12-13 23:41:28" internal="0" group="None" protection="d--x------" name="Smoke" size="3686616" links="0" user="srt">
          <file date="2017-12-12 21:35:50" internal="0" group="None" protection="----------" name="Smoke_Detected_02_05_50.bmp" size="921654" links="0" user="srt" />
          <file date="2017-12-12 21:39:17" internal="0" group="None" protection="----------" name="Smoke_Detected_02_09_17.bmp" size="921654" links="0" user="srt" />
          <file date="2017-12-12 21:39:18" internal="0" group="None" protection="----------" name="Smoke_Detected_02_09_18.bmp" size="921654" links="0" user="srt" />
          <file date="2017-12-12 21:42:29" internal="0" group="None" protection="----------" name="Smoke_Detected_02_12_29.bmp" size="921654" links="0" user="srt" />
        </dir>
      </dir>
      <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="Scripts" size="25590875" links="0" user="srt">
        <file date="2016-12-18 04:57:14" internal="0" group="None" protection="----------" name="_hashlib.pyd" size="1482240" links="0" user="srt" />
        <file date="2016-12-18 04:56:16" internal="0" group="None" protection="----------" name="_socket.pyd" size="50688" links="0" user="srt" />
        <file date="2016-12-18 04:56:54" internal="0" group="None" protection="----------" name="_ssl.pyd" size="2100736" links="0" user="srt" />
        <dir date="2017-12-13 23:41:28" internal="0" group="None" protection="d--x------" name="build" size="2248287" links="0" user="srt">
          <dir date="2017-12-13 23:41:28" internal="0" group="None" protection="d--x------" name="bdist.win-amd64" size="2248287" links="0" user="srt">
            <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="winexe" size="2248287" links="0" user="srt">
              <dir date="2017-12-13 22:46:02" internal="0" group="None" protection="d--x------" name="bundle-2.7" size="0" links="0" user="srt" />
              <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="collect-2.7" size="2246148" links="0" user="srt">
                <file date="2017-12-13 14:41:41" internal="0" group="None" protection="----------" name="__future__.pyc" size="4103" links="0" user="srt" />
                <file date="2017-12-13 14:41:41" internal="0" group="None" protection="----------" name="_abcoll.pyc" size="23604" links="0" user="srt" />
                <dir date="2017-12-13 23:41:29" internal="0" group="None" protection="d--x------" name="email" size="125408" links="0" user="srt">
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="2752" links="0" user="srt" />
                  <dir date="2017-12-13 23:41:29" internal="0" group="None" protection="d--x------" name="mime" size="110" links="0" user="srt">
                    <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="110" links="0" user="srt" />
                  </dir>
                </dir>
                <dir date="2017-12-13 23:41:30" internal="0" group="None" protection="d--x------" name="encodings" size="413685" links="0" user="srt">
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="4298" links="0" user="srt" />
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="aliases.pyc" size="8750" links="0" user="srt" />
                </dir>
                <dir date="2017-12-13 23:41:30" internal="0" group="None" protection="d--x------" name="json" size="41094" links="0" user="srt">
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="13824" links="0" user="srt" />
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="decoder.pyc" size="11720" links="0" user="srt" />
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="encoder.pyc" size="13381" links="0" user="srt" />
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="scanner.pyc" size="2169" links="0" user="srt" />
                </dir>
                <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="logging" size="55152" links="0" user="srt">
                  <file date="2017-12-13 14:41:42" internal="0" group="None" protection="----------" name="__init__.pyc" size="55152" links="0" user="srt" />
                </dir>
                <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="python_http_client" size="12280" links="0" user="srt">
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="611" links="0" user="srt" />
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="client.pyc" size="8422" links="0" user="srt" />
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="exceptions.pyc" size="3247" links="0" user="srt" />
                </dir>
                <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="sendgrid" size="46335" links="0" user="srt">
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="290" links="0" user="srt" />
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="sendgrid.pyc" size="2552" links="0" user="srt" />
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="version.pyc" size="369" links="0" user="srt" />
                  <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="helpers" size="43124" links="0" user="srt">
                    <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="116" links="0" user="srt" />
                    <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="mail" size="43008" links="0" user="srt">
                      <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="156" links="0" user="srt" />
                      <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="mail.pyc" size="42852" links="0" user="srt" />
                    </dir>
                  </dir>
                </dir>
                <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="unittest" size="91680" links="0" user="srt">
                  <file date="2017-12-13 14:41:43" internal="0" group="None" protection="----------" name="__init__.pyc" size="2944" links="0" user="srt" />

                </dir>
              </dir>
              <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="temp" size="2139" links="0" user="srt">
                <file date="2017-12-13 14:41:41" internal="0" group="None" protection="----------" name="_hashlib.py" size="358" links="0" user="srt" />
              </dir>
            </dir>
          </dir>
        </dir>
        <dir date="2017-12-13 23:41:31" internal="0" group="None" protection="d--x------" name="dist" size="11668574" links="0" user="srt">
          <file date="2016-12-18 04:57:14" internal="0" group="None" protection="----------" name="_hashlib.pyd" size="1482240" links="0" user="srt" />
          <file date="2016-12-18 04:56:16" internal="0" group="None" protection="----------" name="_socket.pyd" size="50688" links="0" user="srt" />
        </dir>
      </dir>
    </dir>
    <dir date="2018-02-05 19:41:24" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="Windows" size="76473013" links="0" user="unknown">
      <dir date="2018-04-17 22:11:40" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="System32" size="76473013" links="0" user="unknown">
        <dir date="2017-12-14 00:11:42" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="drivers" size="76473013" links="0" user="unknown">
          <file date="2017-03-18 21:56:34" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="BtaMPM.sys" size="23552" links="0" user="unknown" />
          <file date="2017-03-18 21:56:19" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="BthAvrcpTg.sys" size="43520" links="0" user="unknown" />
          <dir date="2017-03-19 03:49:53" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="en-US" size="1423360" links="0" user="unknown">
            <file date="2017-03-18 06:45:38" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="hidbth.sys.mui" size="5120" links="0" user="unknown" />
            <file date="2017-03-18 06:45:50" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="hidclass.sys.mui" size="6656" links="0" user="unknown" />
          </dir>
          <dir date="2017-03-18 22:03:39" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="etc" size="23907" links="0" user="unknown">
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="hosts" size="824" links="0" user="SYSTEM" />
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="lmhosts.sam" size="3683" links="0" user="SYSTEM" />
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="networks" size="407" links="0" user="SYSTEM" />
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="protocol" size="1358" links="0" user="SYSTEM" />
            <file date="2017-03-18 22:01:13" internal="0" group="unknown" protection="-rwxrwx---" name="services" size="17635" links="0" user="SYSTEM" />
          </dir>
          <dir date="2017-07-11 06:41:34" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="UMDF" size="1743264" links="0" user="unknown">
            <file date="2017-03-18 21:56:19" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="EhStorPwdDrv.dll" size="85504" links="0" user="unknown" />
            <file date="2017-07-11 06:40:08" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="NfcCx.dll" size="710656" links="0" user="unknown" />
            <dir date="2017-03-19 03:47:47" internal="0" group="TrustedInstaller" protection="drwxrwx---" name="en-US" size="66048" links="0" user="unknown">
              <file date="2017-03-18 06:47:42" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="hidscanner.dll.mui" size="2560" links="0" user="unknown" />
              <file date="2017-03-18 06:47:42" internal="0" group="TrustedInstaller" protection="-rwxrwx---" name="IddCx.dll.mui" size="7168" links="0" user="unknown" />
            </dir>
          </dir>
        </dir>
      </dir>
    </dir>
  </dir>
</browse>

现在,从这个XML,我想得到以下内容:

1)按以下格式解析几乎所有内容:

Parent_DIR:在这种情况下是C:

SQLite中的文件到DIR关系:驻留在C中的文件夹'Test_Software'包含26.avi,6.avi等。 对于每个文件夹,列出该文件夹中的文件,并以下列格式将该信息存储在DB的DIR_INFO表中: DIR名称为表的DIR_Name列,文件列表信息为File_Name列

2)获取所有文件的列表及其父路径。注意:查找名为“ future .pyc”的文件,其原始路径为:C:/Test_Software/Scripts/build/bdist.win-amd64/winexe/collect-2.7/ future < /strong>.pyc 我想在DB表File_info的File_name列中添加 future .pyc的文件名,然后在同一个DB表的file_full_path列的完整路径

所以,我能够使用以下代码获取所有File_info和Dir_info:

xml_out = etree.fromstring(single_bk)
file_info = xml_out.xpath('/browse/dir/file')  # Files Path
dir_info = xml_out.xpath('/browse/dir')  #Parent Path
dir_names = []
file_names = [] 
file_dates = []
dir_dates = []
dir_group = []
file_group = []
file_permissions  = []
dir_permissions  = []
file_size = []
dir_size = []
file_user = []
dir_user  = []

for node in file_info:
    file_names.append(node.xpath("@name")[0])
    file_dates.append(node.xpath("@date")[0])
    file_group.append(node.xpath("@group")[0])
    file_permissions.append(node.xpath("@protection")[0])
    file_size.append(node.xpath("@size")[0])
    file_user.append(node.xpath("@user")[0])

for node in dir_info:
    dir_names.append(node.xpath("@name")[0])
    dir_dates.append(node.xpath("@date")[0])
    dir_group.append(node.xpath("@group")[0])
    dir_permissions.append(node.xpath("@protection")[0])
    dir_size.append(node.xpath("@size")[0])
    dir_user.append(node.xpath("@user")[0])                 
print file_names, file_dates, file_group, file_permissions, file_size, file_user
print '\n------------------------------------------------------------------\n'  
print dir_names, dir_dates, dir_group, dir_permissions, dir_size, dir_user  
print '\n------------------------------------------------------------------\n'      
list_of_attributes = []
for node in dir_info:
    attrs = []
    for att in node.attrib:
        #attrs.append(("@" + att, node.attrib[att]))
        attrs.append((node.attrib[att]))            
        list_of_attributes.append(attrs)
print list_of_attributes        
print '\n------------------------------------------------------------------\n'  
print attrs 

但是,它有以下限制:

1)我无法使用此方法映射整个XML,因为XML可以达到无穷大,可能需要将所有内容放在XML的无限循环中

2)我无法映射文件 - &gt; dir关系,因为使用此xml将文件映射到最外层的父项似乎太复杂了。例如:如果我正在阅读文件 future .pyc',其原始路径为:C:/Test_Software/Scripts/build/bdist.win-amd64/winexe/collect-2.7/ 未来 .pyc,我想不出一个方法可以回到c:\ test_software并拥有完整的file_path

3)我想将sirite Dir_info表和file_info中的dir信息存储到file_info表中

这是我的sqlite表信息

数据库名称:abc.db 表格: a)DIR_INFO b)FILE_INFO

Columns of DIR_INFO Table
ID, DIR_DATE, DIR_NAME, DIR_FILE, DIR_SIZE , DIR_PERMISSIONS, DIR_USER

Columns of FILE_INFO Table
ID, FILE_DATE, FILE_NAME, FILE_PARENT_DIR, FILE_FULL_PATH, FILE_SIZE , FILE_PERMISSIONS, FILE_USER

如果你到达这里阅读我的整个请求,首先我感谢你的时间,其次任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

我通过使用正则表达式而不是LXML解析器来解决这个问题。有关详细信息,请参阅此处:Parse Output for Python