Question

这也在https://github.com/cantino/ruby-readability/issues/66

报告

我正在使用ruby-readability，https://github.com/cantino/ruby-readability

问题是它会为内容返回额外的div。

例如：

content = "Remind's classroom communication app is used in more than half of all US public schools. It's be
  cause its co-founder Brett Kopf and team are unabashedly obsessed with their users. Here's how they
build remarkable relationships with customers. <br /><br />\n<a href=\"http://firstround.com/review/
your-users-deserve-better-an-inside-look-at-reminds-customer-obsession/?utm_medium=rss&amp;utm_sourc
e=frr_feed&amp;utm_campaign=home_stream&amp;utm_content=read_more\">Continue reading at First Round
Review &raquo;</a>"
  content =  Readability::Document.new(content, :tags => %w[div p a], :attributes => %w[src href], :remove_empty_nodes => true).content

将返回

=> "<div><div><p>Remind's classroom communication app is used in more than half of all US public sch
ools. It's be\n  cause its co-founder Brett Kopf and team are unabashedly obsessed with their users.
 Here's how they\nbuild remarkable relationships with customers. </p><p><a href=\"http://firstround.
com/review/&#10;your-users-deserve-better-an-inside-look-at-reminds-customer-obsession/?utm_medium=r
ss&amp;utm_sourc&#10;e=frr_feed&amp;utm_campaign=home_stream&amp;utm_content=read_more\">Continue re
ading at First Round\nReview »</a></p></div></div>"

我想知道，问题是什么，以及我可以做些什么来解决这个问题？

Answer 1

这个问题似乎源于你的输入html只有一个段落。如果我理解正确，ruby-readibility gem似乎会在输入html 中搜索包含一个或多个段落的文章（通常用<div>标记表示）（{ {1}}元素）。它搜索所有这些段落，计算它们的相关性并尝试确定页面上的主要文章。

重要的事实是，它确定了＆＃34;文章＆＃34;作为得分最高的段落的父节点（参见here）。

现在，get_article method中添加了两个<p>标记。首先始终的方法用<div>（here）包装找到的文章。然后，它会复制找到的文章的所有子标记，如果文章本身与<div>或<p>不同，则会将标记更改为<div>（here）。由于您的文章节点（即输入html中单个段落的父节点）是<div>标记，因此会将其更改为<body>标记，从而有效地生成两个<div>个输出。

当文章实际上是<div>方法中页面的主体时，解决此问题的最正确方法可能是对案例的特殊处理。或者，您可能只是忽略了案例中的双get_article。

ruby-readability：额外的div添加

1 个答案: