StormCrawler:超时等待池中的连接

时间:2018-03-07 10:23:18

标签: web-crawler stormcrawler

当我们增加线程数或Fetcher bolt的执行程序数时,我们一直会收到以下错误。

org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[stormjar.jar:?]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~[stormjar.jar:?]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) ~[stormjar.jar:?]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[stormjar.jar:?]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[stormjar.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:71) ~[stormjar.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:220) ~[stormjar.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:164) ~[stormjar.jar:?]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:139) ~[stormjar.jar:?]
at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:206) ~[stormjar.jar:?]

这是由于资源泄漏还是对http线程池大小的一些硬性限制?如果它是关于线程池的,有没有办法增加池大小?

1 个答案:

答案 0 :(得分:0)

HttpProtocol中的池集最大连接数,这是使用的线程数(fetcher.threads.number)。由于池是静态的,因此它由同一个worker上的所有执行程序使用。我建议你为每个worker使用一个FetcherBolt实例,它将与fetcher.threads.number的值相同,你就不会遇到这个问题。

或者,您可以试试okhttp protocol。它对于开放和大规模爬行更加健壮。有关功能比较,请参阅WIKI page on protocols