Question

即使在书籍行业，DRM也是一种瘟疫。上周我发现我的许多Kindle注释都丢失了，因为出版商试图将注释限制在本书的10％。

我发现了将Mobi book文件转换为HTML的工具。我还使用了位置数据（幸运的是，这并不缺少）来提取原始html的相应块。我现在的问题是我有很多不完整的标记语言需要处理。

示例：

></h1><div height="3em"></div> <p height="0em" width="1em" align="justify"><em>A Pocket Mirror for Heroes</em> is a book of stratagems for reaching excellence in a competitive world ruled by appearances and, often, deceit.</p><div height="0em"></div> <p height="0em" width="1em" align="justify">It is a <em>mirror</em> because it reflects &#x201C;the person you are or the one you ought to be.&#x201D; A <em>pocket</em> mirror because its author took the time to be brief. A mirror for <em>heroes</em> because it provides a vivid image of ethical and moral perfection. For the author, a hero is &#x201C;the consummate person, ripe and perfect: accurate in judgment, mature in taste, attentive in listening, wise in sayings, shrewd in deeds, the cente

这是因为Kindle中的位置数据仅对应于150个字节的HTML数据块。这意味着有很多不精确的地方。

我想清理一下。有没有人有什么建议？如果可能的话，我更愿意使用Python。

编辑：可能有意义的是使用一个可以给予角色偏移的工具，它会弄清楚如何从中提取清晰的东西。这样的事情存在吗？

Answer 1

BeautifulSoup可以解析格式错误的HTML，而且非常强大。

>>> html = "<p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"
>>> soup = BeautifulSoup(html)
>>> print(soup.prettify())
<p>
 Para 1
 <p>
  Para 2
  <blockquote>
   Quote 1
   <blockquote>
    Quote 2
   </blockquote>
  </blockquote>
 </p>
</p>

从损坏的HTML中提取文本？

1 个答案: