从不同的远程源读取:如何优化线程?

时间:2014-05-25 15:05:04

标签: java multithreading remote-access

我需要阅读网络上的大量网页。它是我实际用于获取远程网页的方法。请注意,当前代码是100%正常工作。

    static private GetWebPageResult getWebPage(PageNode pagenode)
{
    String result;
    String inputLine;
    URI url;
    int cicliLettura=0;
    long startTime=0, endTime, openConnTime=0,connTime=0, readTime=0;
    try
    {
        startTime=System.nanoTime();
        result="";
        url=pagenode.getUri();      //fare qualcosa se getURI è null
        if(Core.logGetWebPage())
            openConnTime=System.nanoTime();
        if(url!=null)
        {
            HttpURLConnection yc = (HttpURLConnection) url.toURL().openConnection(); //controllare yc
            if(url.toURL().getProtocol().equalsIgnoreCase("https"))
                yc=(HttpsURLConnection)yc;
            yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB;     rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"); 
            yc.connect();           //controllare il risultato di .connect => non c'è! al max lancia IOEXC
            if(checkResponseCode(yc.getResponseCode())==false)
                return new GetWebPageResult(GetWebPageResult.ERR_BAD_RESPONSE_CODE,yc.getResponseCode());
            if(Core.logGetWebPage())
                connTime=System.nanoTime();

            BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));//può lanciare IOEXC
            /*
            while ((inputLine = in.readLine()) != null)
            {
                result=result+inputLine+"\n";
                cicliLettura++;
            }*/
            StringBuffer buffer = new StringBuffer();
            while ((inputLine = in.readLine()) != null)
            {
                buffer.append(inputLine).append('\n');
                cicliLettura++;
            }
            result = buffer.toString();

            if(Core.logGetWebPage())
                readTime=System.nanoTime();
            in.close();
            yc.disconnect();
            if(Core.logGetWebPage())
            {
                endTime=System.nanoTime();
                        //url.toURL() non è null, controllato prima
                System.out.println(/*result+*/"getWebPage eseguito in "+(endTime-startTime)/1000000+" ms. Size: "+result.length()+" Response Code="+yc.getResponseCode()+" Protocollo="+url.toURL().getProtocol()+" openConnTime: "+(openConnTime-startTime)/1000000+" connTime:"+(connTime-openConnTime)/1000000+" readTime:"+(readTime-connTime)/1000000+" cicliLettura="+cicliLettura+" pagina:"+url.toURL());
            }
            return new GetWebPageResult(result);
        }
        else
            return new GetWebPageResult(GetWebPageResult.ERR_NULL_URI,-2);
    }catch(IOException e){
        System.out.println("Eccezione1: "+e.toString());
        e.printStackTrace();  
        return new GetWebPageResult(GetWebPageResult.ERR_HTML_IOEXCEPTION,-2);
    }catch(ClassCastException e){
        System.out.println("Eccezione2: "+e.toString());
        e.printStackTrace(); 
        return new GetWebPageResult(GetWebPageResult.ERR_CLASS_CAST_EXC,-2);
    }catch(Exception e){
        System.out.println("Eccezione ERR_NOT_LISTED_EXC: "+e.toString());
        return new GetWebPageResult(GetWebPageResult.ERR_NOT_LISTED_EXC,-2);
    }
}

鉴于url不为null,请让我们仔细查看代码

HttpURLConnection yc = (HttpURLConnection) url.toURL().openConnection(); //controllare yc
            if(url.toURL().getProtocol().equalsIgnoreCase("https"))
                yc=(HttpsURLConnection)yc;
            yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB;     rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"); 
            yc.connect();           //controllare il risultato di .connect => non c'è! al max lancia IOEXC
            if(checkResponseCode(yc.getResponseCode())==false)
                return new GetWebPageResult(GetWebPageResult.ERR_BAD_RESPONSE_CODE,yc.getResponseCode());

.openConnection和.connect方法有什么区别? 无论如何,当我们打开连接时,我们开始读取数据

                BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));//può lanciare IOEXC
            StringBuffer buffer = new StringBuffer();
            while ((inputLine = in.readLine()) != null)
            {
                buffer.append(inputLine).append('\n');
                cicliLettura++;
            }
            result = buffer.toString();

好吧,现在我有了一个BufferedReader,我可以从中读取数据。问题是我的带宽通常远远大于远程机器的带宽,因此我希望能够在同一时间从不同的来源读取#34;。一个很好的方法似乎启动了许多线程,并修改代码的最后部分,如

虽然不是文件结尾,那么是否有完整的行阅读?如果是,请求换行,否则请稍微睡一觉。在这一点上,我继续下一个阅读线程并做同样的事情。这是对的吗?如何实现这个?

1 个答案:

答案 0 :(得分:1)

这看起来像是一个经典的制作人/消费者场景。您可以通过创建以下类来优化应用程序。如果您还不知道BlockingQueue的概念和生产者 - 消费者问题,我建议您在继续我的答案/设计之前阅读this

  1. WebPageResult:表示网页内容的Pojo。包含一个StringBuffer,用于保存网页的结果和网页的名称/网址,以标识内容的页面。
  2. ProcessingQueue:具有ArrayBlockingQueue的单例类,用于保存PageNode对象和方法,以便从此队列中添加和轮询PageNode
  3. ResultQueue:具有ArrayBlockingQueue的单例类,用于保存WebPageResult对象和方法,以便从此队列中添加和轮询WebPageResult。
  4. WebPageReader:实现Runnable,在while(true)循环中在其run方法中从ProcessingQueue调用poll方法,并读取从队列中轮询的PageNode的内容。从PageNode读取的内容应该包装到WebPageResult中,并通过调用ResultQueue中的add方法放入ResultQueue。
  5. WebPageProcessor:实现Runnable,在while(true)循环中在其run方法中从ResultQueue调用poll方法,然后使用此内容执行任何操作。
  6. 现在你需要做的就是将PageNode对象添加到ProcessingQueue中,启动WebPageReader和WebPageProcessor线程并观察魔术的发生。如果您需要任何澄清,请告诉我。根据您的要求,您可以选择仅启动一个WebPageReader线程和WebPageProcessor线程或多个。设计支持两者。此外,您可以通过抓取Web或为要爬网的页面轮询某种数据库来引入另一个用于将PageNode对象添加到ProcessingQueue的线程。