从损坏的HTML中提取文本?

时间:2015-06-25 06:13:01

标签: python html screen-scraping kindle

即使在书籍行业,DRM也是一种瘟疫。上周我发现我的许多Kindle注释都丢失了,因为出版商试图将注释限制在本书的10%。

我发现了将Mobi book文件转换为HTML的工具。我还使用了位置数据(幸运的是,这并不缺少)来提取原始html的相应块。我现在的问题是我有很多不完整的标记语言需要处理。

示例:

></h1><div height="3em"></div> <p height="0em" width="1em" align="justify"><em>A Pocket Mirror for Heroes</em> is a book of stratagems for reaching excellence in a competitive world ruled by appearances and, often, deceit.</p><div height="0em"></div> <p height="0em" width="1em" align="justify">It is a <em>mirror</em> because it reflects &#x201C;the person you are or the one you ought to be.&#x201D; A <em>pocket</em> mirror because its author took the time to be brief. A mirror for <em>heroes</em> because it provides a vivid image of ethical and moral perfection. For the author, a hero is &#x201C;the consummate person, ripe and perfect: accurate in judgment, mature in taste, attentive in listening, wise in sayings, shrewd in deeds, the cente

这是因为Kindle中的位置数据仅对应于150个字节的HTML数据块。这意味着有很多不精确的地方。

我想清理一下。有没有人有什么建议?如果可能的话,我更愿意使用Python。

编辑:可能有意义的是使用一个可以给予角色偏移的工具,它会弄清楚如何从中提取清晰的东西。这样的事情存在吗?

1 个答案:

答案 0 :(得分:2)

BeautifulSoup可以解析格式错误的HTML,而且非常强大。

>>> html = "<p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"
>>> soup = BeautifulSoup(html)
>>> print(soup.prettify())
<p>
 Para 1
 <p>
  Para 2
  <blockquote>
   Quote 1
   <blockquote>
    Quote 2
   </blockquote>
  </blockquote>
 </p>
</p>