风暴爬虫种子注入在AWS节点上失败

时间:2019-11-13 09:52:06

标签: stormcrawler

我在AWS实例(ES 7.3.0,Storm 1.2.3)上使用Storm-Crawler 1.15 种子注入(ESSeedInjector)失败,我不知道为什么。基本上,传递给“入队”螺栓的每个URL都会失败。

这是apache worker日志的摘录:

...
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquaplex.pvi.com/, , DISCOVERED]
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Execute done TUPLE source: filter:5, stream: status, id: {}, [http://aquaplex.pvi.com/, discoveryDate: 2019-11-13T09:38:18.937Z
, DISCOVERED] TASK: 4 DELTA: -1
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquaponics.com/aquaponic-systems/com\
mercial-systems/, , DISCOVERED]
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Execute done TUPLE source: filter:5, stream: status, id: {}, [http://aquaponics.com/aquaponic-systems/commercial-systems/, disc\
overyDate: 2019-11-13T09:38:18.937Z
, DISCOVERED] TASK: 4 DELTA: -1
2019-11-13 09:38:18.937 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquarium-fish.kamihata.net/guppy/, ,\
 DISCOVERED]
2019-11-13 09:38:18.935 c.d.s.e.p.StatusUpdaterBolt I/O dispatcher 2 [ERROR] Exception with bulk 1 - failing the whole lot
org.elasticsearch.ElasticsearchStatusException: Unable to parse response body
        at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1707) ~[stormjar.jar:?]
        at org.elasticsearch.client.RestHighLevelClient$1.onFailure(RestHighLevelClient.java:1621) [stormjar.jar:?]
        at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onDefinitiveFailure(RestClient.java:564) [stormjar.jar:?]
        at org.elasticsearch.client.RestClient$1.completed(RestClient.java:310) [stormjar.jar:?]
        at org.elasticsearch.client.RestClient$1.completed(RestClient.java:294) [stormjar.jar:?]
        at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) [stormjar.jar:?]
        at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181) [stormjar.jar:?]
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448) [stormjar.jar:?]
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338) [stormjar.jar:?]
        at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) [stormjar.jar:?]
        at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) [stormjar.jar:?]
        at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [stormjar.jar:?]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) [stormjar.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://node-1], URI [/_bulk?timeout=1m], status line [HTTP/1.1 404 Not Found]
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /_bulk was not found on this server.</p>
</body></html>

        at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:253) ~[stormjar.jar:?]
        at org.elasticsearch.client.RestClient.access$900(RestClient.java:95) ~[stormjar.jar:?]
        at org.elasticsearch.client.RestClient$1.completed(RestClient.java:298) ~[stormjar.jar:?]
        ... 16 more
2019-11-13 09:38:18.938 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Execute done TUPLE source: filter:5, stream: status, id: {}, [http://aquarium-fish.kamihata.net/guppy/, discoveryDate: 2019-11-\
13T09:38:18.937Z
, DISCOVERED] TASK: 4 DELTA: -1
2019-11-13 09:38:18.938 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquarius-spectrum.com/pdf/Aquarius-S\
pectrum-PPT.pdf, , DISCOVERED]
2019-11-13 09:38:18.938 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Execute done TUPLE source: filter:5, stream: status, id: {}, [http://aquarius-spectrum.com/pdf/Aquarius-Spectrum-PPT.pdf, disco\
veryDate: 2019-11-13T09:38:18.938Z
, DISCOVERED] TASK: 4 DELTA: -1
2019-11-13 09:38:18.938 o.a.s.d.executor Thread-16-enqueue-executor[4 4] [INFO] Processing received message FOR 4 TUPLE: source: filter:5, stream: status, id: {}, [http://aquaservinc.com/contact-us/, , DISCO\
...

有人遇到过同样的问题吗?

1 个答案:

答案 0 :(得分:0)

您是否正确指定了ES服务器的端口?看来您正在连接普通的HTTP服务器,但是对_bulk一无所知。

请参见example ES conf。如果未指定,它将选择默认端口,因此仅设置主机应该没问题。