Question

我使用requests和BeautifulSoup编写了以下简单的HTML过滤器，该过滤器应该获取允许的标记和属性的列表，并删除列表中未包含的所有标记：

def filter_tags(soup, tags_to_keep, allowed_attrs=None):

    for tag in soup.body.descendants:
        if tag.name in tags_to_keep:
            new_attrs = dict()
            for k,v in tag.attrs.items():
                if allowed_attrs and k in allowed_attrs:
                    new_attrs[k] = v
            tag.attrs = new_attrs

        elif isinstance(tag, NavigableString):
            continue

        else:
            # insert one whitespace char so words on either side of tag aren't concat'ed together
            tag.insert_before(" ")
            tag.decompose()

    return soup

我正在这样调用函数：soup = filter_tags(soup, tags_to_keep=['html', 'body', 'div', 'a'], allowed_attrs=['href'])。

此功能似乎适用于像这样的简单输入：

<body>
    <div id="cats">
        This is a test
        <a href="http://www.google.com">Google!</a> 
        More text
    </div>
    <div>
        More text
        <script>...oh javascript...</script>
    </div>
</body>

“有效”是指它正确删除了<script>和<img>标签以及id=属性，同时保留了指定的标签和href= attr），所以之后看起来像这样：

<body>
    <div>
        This is a test
        <a href="http://www.google.com">Google!</a> 
        More text
    </div>
    <div>
        More text
    </div>
</body>

但是，对于更复杂的HTML，它会完全失败（失败的页面示例是http://www.cnn.com），并且不会从HTML剥离<script>标签，标签属性等。我得到这样的输出：

import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://www.cnn.com').text, 'lxml')
filter_tags(soup, tags_to_keep=['body', 'a', 'div'], allowed_attrs=['href'])

              |
              |
              V    

... <li class="m-legal__list__item last-child">
<a class="m-legal__links" data-analytics="footer_cnn-newsource" href="http://cnnnewsource.com">CNN Newsource</a></li>
</ul></div></div></div></div></div></footer>
<div class="OUTBRAIN" data-ob-template="cnn" data-src="" data-widget-id="TR_1"></div>
 <script>(function (d) ...

如您所见，它并没有在像这样的更复杂的HTML中删除诸如<script>或class=之类的任何标记/内容，但是我无法基于简单的方法找出原因测试，似乎可以正常工作...

我上面的函数有什么问题，使它无法删除复杂HTML的标记/属性？我的直觉是，当我遍历.decompose()时，它可能与使用.descendants修改DOM树有关，但是我不确定。如果这是问题所在，那么我在这里尝试使用的方法还有什么替代方法？

为什么我的HTML清理程序无法删除标签？

0 个答案: