我正在尝试迁移一个我无法控制数据库等的网络论坛,并使用Scrapy进行整理。它基于旧的phpBB论坛2.x。 它的结构不是很好,因此存在一些挑战。
我现在有了一个HTML字符串,需要删除周围的<td></td>
,<span></span>
和底部的“报告”链接
开始于:
<td colspan="2"><span class="postbody"></span>
<table width="90%" cellspacing="1" cellpadding="3" border="0" align="center">
<tr>
<td><span class="genmed"><b>Some wrote :</b></span></td>
</tr>
<tr>
<td class="quote">
<table width="90%" cellspacing="1" cellpadding="3" border="0" align="center">
<tr>
<td><span class="genmed"><b>Another wrote:</b></span></td>
</tr>
<tr>
<td class="quote">Just for test
<a href="https://something.com">a link</a>
</td>
</tr>
</table>
<span class="postbody">
<br>
<br>
Test quote #1</span>
</td>
</tr>
</table>
<span class="postbody">
<br>
<br>
Test quote #2<br>
Another link: <a href="https://another.com">linktext</a><br>
_________________<br>/ author
<br>
text<br>
<div align="right">[ <a href="#" class="postlink" onclick="abuse('http://gt40.forum24.se/viewtopic.php?p=6537*6537&forum=gt40');">Rapportera</a>
] </div>
</span><span class="gensmall"></span>
</td>
想要的结果:
<table width="90%" cellspacing="1" cellpadding="3" border="0" align="center">
<tr>
<td><span class="genmed"><b>Some wrote :</b></span></td>
</tr>
<tr>
<td class="quote">
<table width="90%" cellspacing="1" cellpadding="3" border="0" align="center">
<tr>
<td><span class="genmed"><b>Another wrote:</b></span></td>
</tr>
<tr>
<td class="quote">Just for test
<a href="https://something.com">a link</a>
</td>
</tr>
</table>
<span class="postbody">
<br>
<br>
Test quote #1</span>
</td>
</tr>
</table>
<br>
<br>
Test quote #2<br>
Another link: <a href="https://another.com">linktext</a><br>
_________________<br>/ author
<br>
text<br>
有什么提示吗?
答案 0 :(得分:0)
为什么不简单地做
SSL_do_handshake()
和
html = html.strip('<td colspan="2"><span class="postbody"></span>')