剥离HTML元素

时间:2018-09-28 19:39:03

标签: beautifulsoup scrapy

我正在尝试迁移一个我无法控制数据库等的网络论坛,并使用Scrapy进行整理。它基于旧的phpBB论坛2.x。 它的结构不是很好,因此存在一些挑战。

我现在有了一个HTML字符串,需要删除周围的<td></td><span></span>和底部的“报告”链接

开始于:

<td colspan="2"><span class="postbody"></span>
    <table width="90%" cellspacing="1" cellpadding="3" border="0" align="center">
        <tr>
            <td><span class="genmed"><b>Some wrote :</b></span></td>
        </tr>
        <tr>
            <td class="quote">
                <table width="90%" cellspacing="1" cellpadding="3" border="0" align="center">
                    <tr>
                        <td><span class="genmed"><b>Another wrote:</b></span></td>
                    </tr>
                    <tr>
                        <td class="quote">Just for test
                            <a href="https://something.com">a link</a>
                        </td>
                    </tr>
                </table>
                <span class="postbody">
                    <br>

                    <br>
                    Test quote #1</span>
            </td>
        </tr>
    </table>
    <span class="postbody">
        <br>

        <br>
        Test quote #2<br>
        Another link: <a href="https://another.com">linktext</a><br>
        _________________<br>/ author
        <br>
        text<br>
        <div align="right">[ <a href="#" class="postlink" onclick="abuse('http://gt40.forum24.se/viewtopic.php?p=6537*6537&amp;forum=gt40');">Rapportera</a>
            ] </div>
    </span><span class="gensmall"></span>
</td>

想要的结果:

<table width="90%" cellspacing="1" cellpadding="3" border="0" align="center">
    <tr>
        <td><span class="genmed"><b>Some wrote :</b></span></td>
    </tr>
    <tr>
        <td class="quote">
            <table width="90%" cellspacing="1" cellpadding="3" border="0" align="center">
                <tr>
                    <td><span class="genmed"><b>Another wrote:</b></span></td>
                </tr>
                <tr>
                    <td class="quote">Just for test
                        <a href="https://something.com">a link</a>
                    </td>
                </tr>
            </table>
            <span class="postbody">
                <br>

                <br>
                Test quote #1</span>
        </td>
    </tr>
</table>
<br>

<br>
Test quote #2<br>
Another link: <a href="https://another.com">linktext</a><br>
_________________<br>/ author
<br>
text<br>

有什么提示吗?

1 个答案:

答案 0 :(得分:0)

为什么不简单地做

SSL_do_handshake()

html = html.strip('<td colspan="2"><span class="postbody"></span>')