Question

我自动将HTML页面的内容翻译成不同的语言，因此我必须从不同的HTML页面中提取所有文本节点，这些页面有时写得不好（我无法编辑这些HTML）。

通过使用BeautifulSoup，我可以轻松地提取这些文本并用翻译替换它，但是当我在这些操作后显示HTML时：html = BeautifulSoup（source_html） - 它有时会被破坏，因为BeautifulSoup会自动关闭标签（例如表标签在错误的地方关闭）。

有没有办法阻止BeautifulSoup关闭这些标签？

例如，这是我的意见：

html = "<table><tr><td>some text</td></table>" - 关闭tr缺失

汤= BeautufulSoup（html）之后我得到"<table><tr><td>some text</td></tr></table>"

我希望获得与输入完全相同的HTML ...

有可能吗？

Answer 1

BeautifulSoup擅长从格式错误的HTML / XML中解析和提取数据，但如果破坏的HTML不明确，那么它会使用一组规则来解释标记（这可能不是您想要的）。请参阅文档中的Parsing HTML部分，其中的示例与您的情况非常相似。

如果你知道你的标签有什么问题并理解BeautifulSoup使用的规则，你可以稍微增加HTML（可能删除或移动某些标签）以使BeautifulSoup返回你想要的输出。

如果你可以发一个简短的例子，有人可能会给你更具体的帮助。

更新（某些示例）

例如，考虑文档中给出的示例（上面链接）：

from BeautifulSoup import BeautifulSoup
html = """
<html>
<form>
 <table>
 <td><input name="input1">Row 1 cell 1
 <tr><td>Row 2 cell 1
 </form> 
 <td>Row 2 cell 2<br>This</br> sure is a long cell
</body> 
</html>"""
print BeautifulSoup(html).prettify()

<table>标记将在</form>之前关闭，以确保表格正确嵌套在表单中，并使最后<td>挂起。

如果我们了解问题，我们可以在解析前删除</table>来获取正确的结束标签（"<form>"）：

>>> html = html.replace("<form>", "")
>>> soup = BeautifulSoup(html)
>>> print soup.prettify()
<html>
 <table>
  <td>
   <input name="input1" />
   Row 1 cell 1
  </td>
  <tr>
   <td>
    Row 2 cell 1
   </td>
   <td>
    Row 2 cell 2
    <br />
    This
    sure is a long cell
   </td>
  </tr>
 </table>
</html>

如果<form>标记很重要，您仍然可以在解析后添加它。例如：

>>> new_form = Tag(soup, "form")  # create form element
>>> soup.html.insert(0, new_form)  # insert form as child of html
>>> new_form.insert(0, soup.table.extract()) # move table into form
>>> print soup.prettify()
<html>
 <form>
  <table>
   <td>
    <input name="input1" />
    Row 1 cell 1
   </td>
   <tr>
    <td>
     Row 2 cell 1
    </td>
    <td>
     Row 2 cell 2
     <br />
     This
     sure is a long cell
    </td>
   </tr>
  </table>
 </form>
</html>

如何使用BeautifulSoup（python）阻止在错误的HTML中关闭标签？

1 个答案:

更新（某些示例）