Question

我正在尝试解析一个网站。这就是我在做什么我下载源并使用nokogiri遍历数据并获取我需要的信息，如链接，内容等。我已经有了获取数据的脚本。但是，当链接仅在您在实际网站上单击时才起作用，我偶然发现了一个问题。

这是我试图遍历的示例源。

<div class="story-item-content group">
<div class="story-item-details">
  <h3 class="story-item-title">
    <a href="/story/r/how_not_to_fix_your_computer_part_2" target="_blank" class="external-link ">How NOT to fix your computer, part 2.</a>
    <span class="external-link-icon"></span>                                            
    </h3>
    <p class="story-item-description">
         <a href="/search?q=site:zug.com" class="story-item-source" title="More stories from zug.com">zug.com</a>                            <a href="/news/technology/how_not_to_fix_your_computer_part_2" class="story-item-teaser">&mdash; After you read this you should understand what not to do.
        <span class="timestamp">21 hr 59 min ago</span></a>
        <a class="crawl4link" href="http://crawl4.digg.internal/permalink/view/how_not_to_fix_your_computer_part_2">View in Crawl 4</a>
    </p>
</div>

所以在第4行。链接href =“/ story / r / how_not_to_fix_your_computer_part_2

仅适用于实际网站。当我下载源并单击链接时。它不会起作用。我猜这个链接是保存在服务器中的。知道如何获得完整链接？我想有一个单击该链接的脚本，这样我就可以得到工作链接。知道怎么做吗？日Thnx

Answer 1

该网址是相对网址，

所以，如果您所在的网站是：

http://mywebsite.com/index.html

然后您的完整链接

http://mysebsite.com/story/r/how_not_to_fix_your_computer_part_2

Answer 2

它是相对于网站根目录的相对链接。只是前置域名（即example.com/story/r/how_not_to_fix_your_computer_part_2）。

单击链接无法工作的原因是href值是相对于文件存储位置的相对值。将页面下载到本地计算机后，它不再与原始域相关，浏览器将假定它正在http://localhost/story/r/how_not_to_fix_your_computer_part_2查找文件。由于该URL上没有文件或资源，因此失败。

您要做的是通过添加原始域（即digg.com/story/r/how_not_to_fix_your_computer_part_2）将href值更改为绝对URL。然后，当您从本地驱动器中单击它时，它将起作用。

您不必担心在最终结算时添加到网址上的数字，这将由资源处理，位于digg.com/story/r/how_not_to_fix_your_computer_part_2网址。

如何解析网站并获取信息

2 个答案: