Question

我有一个网站，用户可以根据链接提交内容。有没有办法检测链接的主要内容并采取预告片？例如，在Digg上，所有条目都有一个来自链接的小剪辑/摘录。这正是我想要的。

我正在使用Ruby on Rails。我发现了this question on extracting article excerpts，但正确方向的任何提示都会有所帮助。

Answer 1

我发现Digg使用Facebook的Open Graph Protocol（http://ogp.me/）。

最终，这正是我所寻找的！

Ruby Gem OpenGraph： https://github.com/intridea/opengraph

通过访问元数据标签＆＃34; description＆＃34;，我得到了描述，例如

article = OpenGraph.fetch('http://www.page.com/article/1124')
article.description# => 'This is a small description of the movie'

某些页面（但不是大多数文章）没有说明。

Answer 2

如何提取网页的主要文章内容

尝试使用DOM提取文本，这是一个示例页面

<body>
    <div>
        <ul>
            <li><a href="/home">Home</a></li>
            <li><a href="/politics">Politics</a></li>
            <li><a href="/health">Health</a></li>
            <li><a href="/travel">Travel</a></li>
            <li><a href="/about">About</a></li>
        </ul>
    <div>
    <div>
        <div>
            <p><b>MIAMI, Florida (CNN) </b> -- Hurricane Ike weakened slightly...
            <p>Ike hit Turks and Caicos Islands Sunday morning, leaving a trail of...
            <p>"It pretty much looks like an episode of 'The Twilight Zone,' " said...
            <p>Aftwood estimates at least 90 percent of homes he saw on the island were...
            <p>The possibility of similar devastation prompted state and local officials...
            <p > "Let's hope it's all a false alarm," Louisiana Gov. Bobby Jindal said...
        </div>
        <div>
            <p>Some side-story that we don't really care about.</p>
            <p>Another paragraph for this story.</p>
        </div>
        <div>
            <p>Yet another semi-related side-story that we still don't care about.</p>
            <p>Another paragraph for this story.</p>
            <p>Another paragraph for this story.</p>
            <p>Yet another paragraph for this story.</p>
        </div>
    </div>
    <div>© 2008 Cable News Network.<div>
</body>

显然，我们不关心导航链接文本或两个侧面故事。让我们根据DOM位置分解它。我们在正文的第二个标记的第一个标记中有六个

标记。我们将此位置表示为索引列表，如（2,1，*）。如果我们以这种方式对所有文本节点进行分组，并跟踪每个组包含的文本数量，我们会得到如下表格：

location = characters
(1,1,1,1) = 4
(1,1,2,1) = 8
(1,1,3,1) = 6
(1,1,4,1) = 6
(1,1,5,1) = 5
(2,1,*) = 500
(2,2,*) = 100
(2,3,*) = 250
(3) = 26

如何从网络文章/用户发布的链接中获取预告片摘录？

2 个答案: