这也在https://github.com/cantino/ruby-readability/issues/66
报告我正在使用ruby-readability
,https://github.com/cantino/ruby-readability
问题是它会为内容返回额外的div
。
例如:
content = "Remind's classroom communication app is used in more than half of all US public schools. It's be
cause its co-founder Brett Kopf and team are unabashedly obsessed with their users. Here's how they
build remarkable relationships with customers. <br /><br />\n<a href=\"http://firstround.com/review/
your-users-deserve-better-an-inside-look-at-reminds-customer-obsession/?utm_medium=rss&utm_sourc
e=frr_feed&utm_campaign=home_stream&utm_content=read_more\">Continue reading at First Round
Review »</a>"
content = Readability::Document.new(content, :tags => %w[div p a], :attributes => %w[src href], :remove_empty_nodes => true).content
将返回
=> "<div><div><p>Remind's classroom communication app is used in more than half of all US public sch
ools. It's be\n cause its co-founder Brett Kopf and team are unabashedly obsessed with their users.
Here's how they\nbuild remarkable relationships with customers. </p><p><a href=\"http://firstround.
com/review/ your-users-deserve-better-an-inside-look-at-reminds-customer-obsession/?utm_medium=r
ss&utm_sourc e=frr_feed&utm_campaign=home_stream&utm_content=read_more\">Continue re
ading at First Round\nReview »</a></p></div></div>"
我想知道,问题是什么,以及我可以做些什么来解决这个问题?
答案 0 :(得分:1)
这个问题似乎源于你的输入html只有一个段落。如果我理解正确,ruby-readibility
gem似乎会在输入html 中搜索包含一个或多个段落的文章(通常用<div>
标记表示)({ {1}}元素)。它搜索所有这些段落,计算它们的相关性并尝试确定页面上的主要文章。
重要的事实是,它确定了&#34;文章&#34;作为得分最高的段落的父节点(参见here)。
现在,get_article
method中添加了两个<p>
标记。首先始终的方法用<div>
(here)包装找到的文章。然后,它会复制找到的文章的所有子标记,如果文章本身与<div>
或<p>
不同,则会将标记更改为<div>
(here) 。由于您的文章节点(即输入html中单个段落的父节点)是<div>
标记,因此会将其更改为<body>
标记,从而有效地生成两个<div>
个输出。
当文章实际上是<div>
方法中页面的主体时,解决此问题的最正确方法可能是对案例的特殊处理。或者,您可能只是忽略了案例中的双get_article
。