Question

我目前有一个300 MB的下载HTML页面。它现在无法打开。该文件创建多个条目，其中包含与表中文件相关的数据。样本如下：

<div class="item">
      <h2>install.log</h2> 

<table>
        <col width="1*"/>
        <col width="2*"/>
        <tbody>


          <tr>
            <th>Local Copy:</th>
            <td>
              <a href="./Items/Test/Test/Test.L01/Sideways%123/D/Program%20Files/Mozilla%20Firefox/install.log">install</a>
            </td>
          </tr>



          <tr>
            <th>Name:</th>
            <td>
                                    install.log
                            </td>
          </tr>
          <tr>
            <th>Path Name:</th>
            <td>
                                    /Test/Test/Test.L01/Sideways123/D/Program Files/Mozilla Firefox
                            </td>
          </tr>
          <tr>
            <th>GUID:</th>
            <td>
                                    <tt>efaa12b1-e4b0-4ed8-9d14-b2dbf8d707fe</tt>
                            </td>
          </tr>
          <tr>
            <th>Item Date:</th>
            <td>
                                    Wednesday, 25 March 2009 15:14:39 o'clock GMT
                            </td>
          </tr>
          <tr>
            <th>File Created:</th>
            <td>
                                    Wednesday, 25 March 2009 15:14:36 o'clock GMT
                            </td>
          </tr>
          <tr>
            <th>File Modified:</th>
            <td>
                                    Wednesday, 25 March 2009 15:14:39 o'clock GMT
                            </td>
          </tr>



        </tbody>
      </table>

 </div>

我要做的是删除一些表条目，以便我可以在浏览器中打开文件，并且很多条目都不相关。在上面的示例中，该部分以<h2></h2>开头，其中包含文件的条目，包括文件扩展名。我想写一个python脚本，它基本上允许我列出一些文件扩展名（即.log，.txt等），然后我想编辑HTML页面，只包括那些包含那些文件扩展名，如果不是这种情况，请删除表条目以及该表条目的所有关联数据。因此，在上面的代码中，如果我只是查找带有.xls和.jpg的文件，那么整个数据都会从HTML中删除。

有什么建议吗？

使用Python解析和删除HTML中的表条目

0 个答案: