如何从网络文章/用户发布的链接中获取预告片摘录?

时间:2011-11-29 06:28:18

标签: ruby-on-rails web-scraping

我有一个网站,用户可以根据链接提交内容。有没有办法检测链接的主要内容并采取预告片?例如,在Digg上,所有条目都有一个来自链接的小剪辑/摘录。这正是我想要的。

我正在使用Ruby on Rails。我发现了this question on extracting article excerpts,但正确方向的任何提示都会有所帮助。

2 个答案:

答案 0 :(得分:1)

我发现Digg使用Facebook的Open Graph Protocol(http://ogp.me/)。

最终,这正是我所寻找的!

Ruby Gem OpenGraph: https://github.com/intridea/opengraph

通过访问元数据标签" description",我得到了描述,例如

article = OpenGraph.fetch('http://www.page.com/article/1124')
article.description# => 'This is a small description of the movie'

某些页面(但不是大多数文章)没有说明。

答案 1 :(得分:0)

如何提取网页的主要文章内容

尝试使用DOM提取文本,这是一个示例页面

<body>
    <div>
        <ul>
            <li><a href="/home">Home</a></li>
            <li><a href="/politics">Politics</a></li>
            <li><a href="/health">Health</a></li>
            <li><a href="/travel">Travel</a></li>
            <li><a href="/about">About</a></li>
        </ul>
    <div>
    <div>
        <div>
            <p><b>MIAMI, Florida (CNN) </b> -- Hurricane Ike weakened slightly...
            <p>Ike hit Turks and Caicos Islands Sunday morning, leaving a trail of...
            <p>"It pretty much looks like an episode of 'The Twilight Zone,' " said...
            <p>Aftwood estimates at least 90 percent of homes he saw on the island were...
            <p>The possibility of similar devastation prompted state and local officials...
            <p > "Let's hope it's all a false alarm," Louisiana Gov. Bobby Jindal said...
        </div>
        <div>
            <p>Some side-story that we don't really care about.</p>
            <p>Another paragraph for this story.</p>
        </div>
        <div>
            <p>Yet another semi-related side-story that we still don't care about.</p>
            <p>Another paragraph for this story.</p>
            <p>Another paragraph for this story.</p>
            <p>Yet another paragraph for this story.</p>
        </div>
    </div>
    <div>© 2008 Cable News Network.<div>
</body>

显然,我们不关心导航链接文本或两个侧面故事。让我们根据DOM位置分解它。我们在正文的第二个标记的第一个标记中有六个

标记。我们将此位置表示为索引列表,如(2,1,*)。如果我们以这种方式对所有文本节点进行分组,并跟踪每个组包含的文本数量,我们会得到如下表格:

location = characters
(1,1,1,1) = 4
(1,1,2,1) = 8
(1,1,3,1) = 6
(1,1,4,1) = 6
(1,1,5,1) = 5
(2,1,*) = 500
(2,2,*) = 100
(2,3,*) = 250
(3) = 26