Question

我目前正在一家专注于新闻网站的网络抓取工作。但是它花了太长时间，我相信它是因为脚本当时打开一页，废弃它，然后转到另一页。因此，有时间发送请求，从服务器获取响应以及所有这些。有没有一种方法可以让我每次打开超过一页并最大限度地利用我的带宽？

<div class="slider valign-wrapper">
    <ul class="slides">
      <li>
        <img src="http://lorempixel.com/580/250/nature/1"> <!-- random image -->
        <div class="valign caption center-align">
          <h3>This is our big Tagline!</h3>
          <h5 class="light grey-text text-lighten-3">Here's our small slogan.</h5>
        </div>
      </li>
      <li>
        <img src="http://lorempixel.com/580/250/nature/2"> <!-- random image -->
        <div class="valign caption left-align">
          <h3>Left Aligned Caption</h3>
          <h5 class="light grey-text text-lighten-3">Here's our small slogan.</h5>
        </div>
      </li>
      <li>
        <img src="http://lorempixel.com/580/250/nature/3"> <!-- random image -->
        <div class="valign caption right-align">
          <h3>Right Aligned Caption</h3>
          <h5 class="light grey-text text-lighten-3">Here's our small slogan.</h5>
        </div>
      </li>
      <li>
        <img src="http://lorempixel.com/580/250/nature/4"> <!-- random image -->
        <div class="valign caption center-align">
          <h3>This is our big Tagline!</h3>
          <h5 class="light grey-text text-lighten-3">Here's our small slogan.</h5>
        </div>
      </li>
    </ul>
  </div>

Answer 1

你的直觉是正确的。您的代码一次处理一个页面，而您没有充分利用所有带宽。您可能想要使用评论中提到的某些特定库（scrappy.org）或查看线程。 https://docs.python.org/2/library/threading.html

请注意，使用线程并不像在多个线程中启动代码那么简单。您必须协调他们访问articles列表的方式。您将需要使用线程安全的东西。 https://docs.python.org/2/library/queue.html

最大化Web爬网程序中的带宽使用

1 个答案: