Beautifulsoup分解()

时间:2016-10-05 23:46:26

标签: python python-3.x beautifulsoup

我正在尝试使用beatifulsoup删除<script>标签和标签内的内容。我去了文档,似乎是一个非常简单的函数来调用。有关该功能的更多信息是here。这是我到目前为止解析的html页面的内容......

<body class="pb-theme-normal pb-full-fluid">
    <div class="pub_300x250 pub_300x250m pub_728x90 text-ad textAd text_ad text_ads text-ads text-ad-links" id="wp-adb-c" style="width: 1px !important;
    height: 1px !important;
    position: absolute !important;
    left: -10000px !important;
    top: -1000px !important;
    ">
</div>
<div id="pb-f-a">
</div>
    <div class="" id="pb-root">
    <script>
    (function(a){
        TWP=window.TWP||{};
        TWP.Features=TWP.Features||{};
        TWP.Features.Page=TWP.Features.Page||{};
        TWP.Features.Page.PostRecommends={};
        TWP.Features.Page.PostRecommends.url="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/hybrid.json?callback\x3d?";
        TWP.Features.Page.PostRecommends.trackUrl="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/tracker.json?callback\x3d?";
        TWP.Features.Page.PostRecommends.profileUrl="https://usersegment.wpdigital.net/usersegments";
        TWP.Features.Page.PostRecommends.canonicalUrl=""
    })(jQuery);

    </script>
    </div>
</body>

想象一下,您有一些类似的Web内容,并且您在名为soup_html的BeautifulSoup对象中拥有该内容。如果我运行soup_html.script.decompose()并且他们调用对象soup_html,则脚本标记仍然存在。我如何摆脱<script>以及这些标签内的内容?

markup = 'The html above'
soup = BeautifulSoup(markup)
html_body = soup.body

soup.script.decompose()

html_body

4 个答案:

答案 0 :(得分:6)

  

soup.script.decompose()

这将删除&#34; Soup&#34;中的单个脚本元素只要。相反,我认为你的意思是分解所有这些:

for script in soup("script"):
    script.decompose()

答案 1 :(得分:1)

要详细说明alecxe提供的答案,这里有一个完整的脚本供任何人参考:

selects = soup.findAll('select')
for match in selects:
    match.decompose()

答案 2 :(得分:0)

soup.script.decompose()只会从汤变量中删除它...而不是html_body变量。你也必须从html_body变量中删除它。 (我想。)

答案 3 :(得分:0)

我能够使用以下代码修复此问题...

scripts = soup.findAll(['script', 'style'])
    for match in scripts:
        match.decompose()
        file_content = soup.get_text()
        # Striping 'ascii' code
        content = re.sub(r'[^\x00-\x7f]', r' ', file_content)
    # Creating 'txt' files
    with open(my_params['q'] + '_' + str(count) + '.txt', 'w+') as webpage_out:
        webpage_out.write(content)
        print('The file ' + my_params['q'] + '_' + str(count) + '.txt ' + 'has been created successfully.')
        count += 1

错误是with open(...是部分或for match...

没有工作的代码......

scripts = soup.findAll(['script', 'style'])
    for match in scripts:
        match.decompose()
        file_content = soup.get_text()
        # Striping 'ascii' code
        content = re.sub(r'[^\x00-\x7f]', r' ', file_content)
        # Creating 'txt' files
        with open(my_params['q'] + '_' + str(count) + '.txt', 'w+') as webpage_out:
            webpage_out.write(content)
            print('The file ' + my_params['q'] + '_' + str(count) + '.txt ' + 'has been created successfully.')
            count += 1