我需要阅读网络上的大量网页。它是我实际用于获取远程网页的方法。请注意,当前代码是100%正常工作。
static private GetWebPageResult getWebPage(PageNode pagenode)
{
String result;
String inputLine;
URI url;
int cicliLettura=0;
long startTime=0, endTime, openConnTime=0,connTime=0, readTime=0;
try
{
startTime=System.nanoTime();
result="";
url=pagenode.getUri(); //fare qualcosa se getURI è null
if(Core.logGetWebPage())
openConnTime=System.nanoTime();
if(url!=null)
{
HttpURLConnection yc = (HttpURLConnection) url.toURL().openConnection(); //controllare yc
if(url.toURL().getProtocol().equalsIgnoreCase("https"))
yc=(HttpsURLConnection)yc;
yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
yc.connect(); //controllare il risultato di .connect => non c'è! al max lancia IOEXC
if(checkResponseCode(yc.getResponseCode())==false)
return new GetWebPageResult(GetWebPageResult.ERR_BAD_RESPONSE_CODE,yc.getResponseCode());
if(Core.logGetWebPage())
connTime=System.nanoTime();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));//può lanciare IOEXC
/*
while ((inputLine = in.readLine()) != null)
{
result=result+inputLine+"\n";
cicliLettura++;
}*/
StringBuffer buffer = new StringBuffer();
while ((inputLine = in.readLine()) != null)
{
buffer.append(inputLine).append('\n');
cicliLettura++;
}
result = buffer.toString();
if(Core.logGetWebPage())
readTime=System.nanoTime();
in.close();
yc.disconnect();
if(Core.logGetWebPage())
{
endTime=System.nanoTime();
//url.toURL() non è null, controllato prima
System.out.println(/*result+*/"getWebPage eseguito in "+(endTime-startTime)/1000000+" ms. Size: "+result.length()+" Response Code="+yc.getResponseCode()+" Protocollo="+url.toURL().getProtocol()+" openConnTime: "+(openConnTime-startTime)/1000000+" connTime:"+(connTime-openConnTime)/1000000+" readTime:"+(readTime-connTime)/1000000+" cicliLettura="+cicliLettura+" pagina:"+url.toURL());
}
return new GetWebPageResult(result);
}
else
return new GetWebPageResult(GetWebPageResult.ERR_NULL_URI,-2);
}catch(IOException e){
System.out.println("Eccezione1: "+e.toString());
e.printStackTrace();
return new GetWebPageResult(GetWebPageResult.ERR_HTML_IOEXCEPTION,-2);
}catch(ClassCastException e){
System.out.println("Eccezione2: "+e.toString());
e.printStackTrace();
return new GetWebPageResult(GetWebPageResult.ERR_CLASS_CAST_EXC,-2);
}catch(Exception e){
System.out.println("Eccezione ERR_NOT_LISTED_EXC: "+e.toString());
return new GetWebPageResult(GetWebPageResult.ERR_NOT_LISTED_EXC,-2);
}
}
鉴于url不为null,请让我们仔细查看代码
HttpURLConnection yc = (HttpURLConnection) url.toURL().openConnection(); //controllare yc
if(url.toURL().getProtocol().equalsIgnoreCase("https"))
yc=(HttpsURLConnection)yc;
yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)");
yc.connect(); //controllare il risultato di .connect => non c'è! al max lancia IOEXC
if(checkResponseCode(yc.getResponseCode())==false)
return new GetWebPageResult(GetWebPageResult.ERR_BAD_RESPONSE_CODE,yc.getResponseCode());
.openConnection和.connect方法有什么区别? 无论如何,当我们打开连接时,我们开始读取数据
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));//può lanciare IOEXC
StringBuffer buffer = new StringBuffer();
while ((inputLine = in.readLine()) != null)
{
buffer.append(inputLine).append('\n');
cicliLettura++;
}
result = buffer.toString();
好吧,现在我有了一个BufferedReader,我可以从中读取数据。问题是我的带宽通常远远大于远程机器的带宽,因此我希望能够在同一时间从不同的来源读取#34;。一个很好的方法似乎启动了许多线程,并修改代码的最后部分,如
虽然不是文件结尾,那么是否有完整的行阅读?如果是,请求换行,否则请稍微睡一觉。在这一点上,我继续下一个阅读线程并做同样的事情。这是对的吗?如何实现这个?
答案 0 :(得分:1)
这看起来像是一个经典的制作人/消费者场景。您可以通过创建以下类来优化应用程序。如果您还不知道BlockingQueue的概念和生产者 - 消费者问题,我建议您在继续我的答案/设计之前阅读this。
现在你需要做的就是将PageNode对象添加到ProcessingQueue中,启动WebPageReader和WebPageProcessor线程并观察魔术的发生。如果您需要任何澄清,请告诉我。根据您的要求,您可以选择仅启动一个WebPageReader线程和WebPageProcessor线程或多个。设计支持两者。此外,您可以通过抓取Web或为要爬网的页面轮询某种数据库来引入另一个用于将PageNode对象添加到ProcessingQueue的线程。