ruby-readability:额外的div添加

时间:2016-05-01 03:54:02

标签: ruby-on-rails ruby

这也在https://github.com/cantino/ruby-readability/issues/66

报告

我正在使用ruby-readabilityhttps://github.com/cantino/ruby-readability

问题是它会为内容返回额外的div

例如:

content = "Remind's classroom communication app is used in more than half of all US public schools. It's be
  cause its co-founder Brett Kopf and team are unabashedly obsessed with their users. Here's how they
build remarkable relationships with customers. <br /><br />\n<a href=\"http://firstround.com/review/
your-users-deserve-better-an-inside-look-at-reminds-customer-obsession/?utm_medium=rss&amp;utm_sourc
e=frr_feed&amp;utm_campaign=home_stream&amp;utm_content=read_more\">Continue reading at First Round
Review &raquo;</a>"
  content =  Readability::Document.new(content, :tags => %w[div p a], :attributes => %w[src href], :remove_empty_nodes => true).content

将返回

=> "<div><div><p>Remind's classroom communication app is used in more than half of all US public sch
ools. It's be\n  cause its co-founder Brett Kopf and team are unabashedly obsessed with their users.
 Here's how they\nbuild remarkable relationships with customers. </p><p><a href=\"http://firstround.
com/review/&#10;your-users-deserve-better-an-inside-look-at-reminds-customer-obsession/?utm_medium=r
ss&amp;utm_sourc&#10;e=frr_feed&amp;utm_campaign=home_stream&amp;utm_content=read_more\">Continue re
ading at First Round\nReview »</a></p></div></div>"

我想知道,问题是什么,以及我可以做些什么来解决这个问题?

1 个答案:

答案 0 :(得分:1)

这个问题似乎源于你的输入html只有一个段落。如果我理解正确,ruby-readibility gem似乎会在输入html 中搜索包含一个或多个段落的文章(通常用<div>标记表示)({ {1}}元素)。它搜索所有这些段落,计算它们的相关性并尝试确定页面上的主要文章。

重要的事实是,它确定了&#34;文章&#34;作为得分最高的段落的父节点(参见here)。

现在,get_article method中添加了两个<p>标记。首先始终的方法用<div>here)包装找到的文章。然后,它会复制找到的文章的所有子标记,如果文章本身与<div><p>不同,则会将标记更改为<div>here) 。由于您的文章节点(即输入html中单个段落的父节点)是<div>标记,因此会将其更改为<body>标记,从而有效地生成两个<div>个输出。

当文章实际上是<div>方法中页面的主体时,解决此问题的最正确方法可能是对案例的特殊处理。或者,您可能只是忽略了案例中的双get_article