Question

我正在尝试编写一个快速的HTML抓取器，此时我只是专注于在不解析的情况下最大化我的吞吐量。我已缓存URL的IP地址：

public class Data {
    private static final ArrayList<String> sites = new ArrayList<String>();
    public static final ArrayList<URL> URL_LIST = new ArrayList<URL>();
    public static final ArrayList<InetAddress> ADDRESSES = new ArrayList<InetAddress>();

    static{
        /*
        add all the URLs to the sites array list
        */

        // Resolve the DNS prior to testing the throughput 
        for(int i = 0; i < sites.size(); i++){

            try {
                URL tmp = new URL(sites.get(i));
                InetAddress address = InetAddress.getByName(tmp.getHost());
                ADDRESSES.add(address);
                URL_LIST.add(new URL("http", address.getHostAddress(), tmp.getPort(), tmp.getFile()));
                System.out.println(tmp.getHost() + ": " + address.getHostAddress());
            } catch (MalformedURLException e) {
            } catch (UnknownHostException e) {
            }
        }
    }
}

我的下一步是通过从互联网上获取100个URL来测试速度，读取前64KB并继续下一个URL。我创建了一个FetchTaskConsumer的线程池，我尝试运行多个线程（i7四核机器上的16到64），以下是每个消费者的看法：

public class FetchTaskConsumer implements Runnable{
    private final CountDownLatch latch;
    private final int[] urlIndexes;
    public FetchTaskConsumer (int[] urlIndexes, CountDownLatch latch){
        this.urlIndexes = urlIndexes;
        this.latch = latch;
    }

    @Override
    public void run() {

        URLConnection resource;
        InputStream is = null;
        for(int i = 0; i < urlIndexes.length; i++)
        {
            int numBytes = 0;
            try {                   
                resource = Data.URL_LIST.get(urlIndexes[i]).openConnection();

                resource.setRequestProperty("User-Agent", "Mozilla/5.0");

                is = resource.getInputStream();

                while(is.read()!=-1 && numBytes < 65536 )
                {
                    numBytes++;
                }

            } catch (IOException e) {
                System.out.println("Fetch Exception: " + e.getMessage());
            } finally {

                System.out.println(numBytes + " bytes for url index " + urlIndexes[i] + "; remaining: " + remaining.decrementAndGet());
                if(is!=null){
                    try {
                        is.close();
                    } catch (IOException e1) {/*eat it*/}
                }
            }
        }

        latch.countDown();
    }
}

充其量我可以在大约30秒内完成100个URL，但文献表明我应该能够每秒通过 ~~300~~ 150个URL。请注意，我可以访问千兆以太网，虽然我目前正在我的20 Mbit连接上运行测试...在任何一种情况下，连接都没有真正得到充分利用。

我已尝试直接使用Socket连接，但我必须做错事，因为这甚至更慢！关于如何提高吞吐量的任何建议？

P.S。
我有一个大约100万个热门网址的列表，所以如果100不足以进行基准测试，我可以添加更多网址。

更新
literature I'm referring是与Najork Web Crawler相关的论文，Najork说：

17天内处理了8.91亿个URL 那是每秒606次下载[on] 4 Compaq DS20E Alpha服务器[带] 4 GB主内存[，] 650 GB磁盘空间[和] 100 MBit / sec 以太网ISP限速带宽 160Mbits / sec

所以它实际上是每秒150页，而不是300页。我的计算机是带有4 GB RAM的Core i7，而我距离它还不远。我没有看到任何说明他们特别使用的东西。

更新
好的，算了......最后的结果都在！事实证明，对于基准测试，100个URL有点太低了。我把它提升到了1024个URL，64个线程，我为每次获取设置了2秒的超时时间，我每秒最多可以达到21页（实际上我的连接大约是10.5 Mbps，所以每秒21页* 64KB每页大约10.5 Mbps）。以下是抓取器的样子：

public class FetchTask implements Runnable{
    private final int timeoutMS = 2000;
    private final CountDownLatch latch;
    private final int[] urlIndexes;
    public FetchTask(int[] urlIndexes, CountDownLatch latch){
        this.urlIndexes = urlIndexes;
        this.latch = latch;
    }

    @Override
    public void run() {

        URLConnection resource;
        InputStream is = null;
        for(int i = 0; i < urlIndexes.length; i++)
        {
            int numBytes = 0;
            try {                   
                resource = Data.URL_LIST.get(urlIndexes[i]).openConnection();

                resource.setConnectTimeout(timeoutMS);

                resource.setRequestProperty("User-Agent", "Mozilla/5.0");

                is = resource.getInputStream();

                while(is.read()!=-1 && numBytes < 65536 )
                {
                    numBytes++;
                }

            } catch (IOException e) {
                System.out.println("Fetch Exception: " + e.getMessage());
            } finally {

                System.out.println(numBytes + "," + urlIndexes[i] + "," + remaining.decrementAndGet());
                if(is!=null){
                    try {
                        is.close();
                    } catch (IOException e1) {/*eat it*/}
                }
            }
        }

        latch.countDown();
    }
}

Answer 1

你确定你的总和吗？

每秒300个URL，每个URL读取64千字节

这需要：300 x 64 = 19,200千字节/秒

转换为位：19,200千字节/秒=（8 * 19,200）千位/秒

所以我们有：8 * 19,200 = 153,600千比特/秒

然后到Mb / s：153,600 / 1024 = 150兆位/秒

...但你只有20 Mb / s的频道。

但是，我认为你提取的许多网址的大小都在64Kb以下，因此吞吐量似乎比你的频道快。你不慢，你快！

Answer 2

这次关注你的成就。我自己尝试使用你的代码，发现我每秒大约有3页访问主要网站。但是，如果我访问自己的网络服务器下载静态页面，我就会占用我的系统。

今天在互联网上，一个主要网站通常需要一秒多的时间来生成一个页面。看了刚刚发送给我的数据包，页面到达了多个TCP / IP数据包。从英国这里下载www.yahoo.co.jp需要3秒钟，下载amazon.com需要2秒，但facebook.com需要不到0.1秒。不同的是facebook.com首页是静态的，而另外两个是动态的。对于人类而言，关键因素是第一个字节的时间，即浏览器可以开始执行某些操作的时间，而不是第65536个字节的时间。没人优化： - ）

那么这对你意味着什么？当你专注于热门页面时，我想你也会关注动态页面，它们的发送速度与静态页面一样快。由于我查看的网站正在为其页面发送多个数据包，这意味着如果您同时获取多个页面，因此数据包可能会在以太网上相互碰撞。

当两个网站同时向您发送数据包时，会发生数据包冲突。在某些时候，两个网站的输入必须协调到您的计算机的单线。当两个数据包相互叠加时，组合它们的路由器拒绝这两个数据包，并指示两个发送者在不同的短暂延迟后重新发送。实际上，这会减慢两个站点的速度。

所以：

1）这些天不会快速生成页面。 2）以太网在处理多个同时下载时遇到问题。 3）静态网站（以前比较常见）比动态网站快得多，使用的数据包也少。

这一切都意味着最大化连接非常困难。

您可能会尝试进行与放置1000个64Kb文件相同的测试，并查看代码下载速度有多快。对我来说，你的代码工作正常。

在Java中获取多个网页的最快方法

2 个答案: