我有一个网站,用户可以根据链接提交内容。有没有办法检测链接的主要内容并采取预告片?例如,在Digg上,所有条目都有一个来自链接的小剪辑/摘录。这正是我想要的。
我正在使用Ruby on Rails。我发现了this question on extracting article excerpts,但正确方向的任何提示都会有所帮助。
答案 0 :(得分:1)
我发现Digg使用Facebook的Open Graph Protocol(http://ogp.me/)。
最终,这正是我所寻找的!
Ruby Gem OpenGraph: https://github.com/intridea/opengraph
通过访问元数据标签" description",我得到了描述,例如
article = OpenGraph.fetch('http://www.page.com/article/1124')
article.description# => 'This is a small description of the movie'
某些页面(但不是大多数文章)没有说明。
答案 1 :(得分:0)
如何提取网页的主要文章内容
尝试使用DOM提取文本,这是一个示例页面
<body>
<div>
<ul>
<li><a href="/home">Home</a></li>
<li><a href="/politics">Politics</a></li>
<li><a href="/health">Health</a></li>
<li><a href="/travel">Travel</a></li>
<li><a href="/about">About</a></li>
</ul>
<div>
<div>
<div>
<p><b>MIAMI, Florida (CNN) </b> -- Hurricane Ike weakened slightly...
<p>Ike hit Turks and Caicos Islands Sunday morning, leaving a trail of...
<p>"It pretty much looks like an episode of 'The Twilight Zone,' " said...
<p>Aftwood estimates at least 90 percent of homes he saw on the island were...
<p>The possibility of similar devastation prompted state and local officials...
<p > "Let's hope it's all a false alarm," Louisiana Gov. Bobby Jindal said...
</div>
<div>
<p>Some side-story that we don't really care about.</p>
<p>Another paragraph for this story.</p>
</div>
<div>
<p>Yet another semi-related side-story that we still don't care about.</p>
<p>Another paragraph for this story.</p>
<p>Another paragraph for this story.</p>
<p>Yet another paragraph for this story.</p>
</div>
</div>
<div>© 2008 Cable News Network.<div>
</body>
显然,我们不关心导航链接文本或两个侧面故事。让我们根据DOM位置分解它。我们在正文的第二个标记的第一个标记中有六个
标记。我们将此位置表示为索引列表,如(2,1,*)。如果我们以这种方式对所有文本节点进行分组,并跟踪每个组包含的文本数量,我们会得到如下表格:
location = characters
(1,1,1,1) = 4
(1,1,2,1) = 8
(1,1,3,1) = 6
(1,1,4,1) = 6
(1,1,5,1) = 5
(2,1,*) = 500
(2,2,*) = 100
(2,3,*) = 250
(3) = 26