所以我有两个班负责播种(注入网址)和抓取。
ESSeedInjector类:
public class ESSeedInjector extends ConfigurableTopology {
public static void main(String[] args) {
ConfigurableTopology.start(new ESSeedInjector(), new String[]{".","seeds.txt","-local","-conf", "es-conf.yaml","--sleep","5000"});
}
@Override
public int run(String[] args) {
if (args.length == 0) {
System.err.println("ESSeedInjector seed_dir file_filter");
return -1;
}
conf.setDebug(true);
TopologyBuilder builder = new TopologyBuilder();
Scheme scheme = new StringTabScheme(Status.DISCOVERED);
builder.setSpout("spout", new FileSpout(args[0], args[1], scheme));
Fields key = new Fields("url");
builder.setBolt("filter", new URLFilterBolt()).fieldsGrouping("spout",
key);
builder.setBolt("enqueue", new StatusUpdaterBolt(), 10)
.customGrouping("filter", new URLStreamGrouping());
return submit("ESSeedInjector", conf, builder);
}
抓取程序类:
public class ESCrawlTopology extends ConfigurableTopology {
public static void main(String[] args) {
ConfigurableTopology.start(new ESCrawlTopology(), new String[]{"-conf", "es-conf.yaml", "-local"});
}
@Override
protected int run(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
int numWorkers = ConfUtils.getInt(getConf(), "topology.workers", 1);
int numShards = 1;
builder.setSpout("spout", new CollapsingSpout(), numShards);
builder.setBolt("status_metrics", new StatusMetricsBolt())
.shuffleGrouping("spout");
builder.setBolt("partitioner", new URLPartitionerBolt(), numWorkers)
.shuffleGrouping("spout");
builder.setBolt("fetch", new FetcherBolt(), numWorkers).fieldsGrouping(
"partitioner", new Fields("key"));
builder.setBolt("sitemap", new SiteMapParserBolt(), numWorkers)
.localOrShuffleGrouping("fetch");
builder.setBolt("parse", new JSoupParserBolt(), numWorkers)
.localOrShuffleGrouping("sitemap");
builder.setBolt("indexer", new IndexerBolt(), numWorkers)
.localOrShuffleGrouping("parse");
Fields furl = new Fields("url");
builder.setBolt("status", new StatusUpdaterBolt(), numWorkers)
.fieldsGrouping("fetch", Constants.StatusStreamName, furl)
.fieldsGrouping("sitemap", Constants.StatusStreamName, furl)
.fieldsGrouping("parse", Constants.StatusStreamName, furl)
.fieldsGrouping("indexer", Constants.StatusStreamName, furl);
builder.setBolt("deleter", new DeletionBolt(), numWorkers)
.localOrShuffleGrouping("status",
Constants.DELETION_STREAM_NAME);
conf.registerMetricsConsumer(MetricsConsumer.class);
conf.registerMetricsConsumer(LoggingMetricsConsumer.class);
return submit("crawl", conf, builder);
}
}
流程 -
运行ESSeedInjector类(这成功注入了URL)。
运行Crawler类。
现在开始爬行,但在任意时间它会产生错误。
18892 [elasticsearch[_client_][listener][T#2]] ERROR c.d.s.e.p.CollapsingSpout - Exception with ES query
org.elasticsearch.transport.RemoteTransportException: [2rbuRko][127.0.0.1:9300][indices:data/read/search]
Caused by: org.elasticsearch.transport.RemoteTransportException: [2rbuRko][127.0.0.1:9300][indices:data/read/msearch]
Caused by: java.lang.IllegalArgumentException: Validation Failed: 1: no requests added;
at org.elasticsearch.action.ValidateActions.addValidationError(ValidateActions.java:29) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.MultiSearchRequest.validate(MultiSearchRequest.java:90) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:131) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:64) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:54) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService.sendLocalRequest(TransportService.java:621) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService.access$000(TransportService.java:73) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService$3.sendRequest(TransportService.java:133) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:569) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:502) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:529) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:520) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.SearchTransportService.sendExecuteMultiSearch(SearchTransportService.java:182) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:93) ~[?:?]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:144) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:138) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:207) ~[?:?]
at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:105) ~[?:?]
at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:117) ~[?:?]
at org.elasticsearch.action.search.FetchSearchPhase.access$000(FetchSearchPhase.java:45) ~[?:?]
at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:87) ~[?:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.3.0.jar:5.3.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
不确定导致错误的是什么,但我看到的模式是,如果通过运行ESIndex.Init从ElasticSearch擦除数据然后执行ESSeedInjector然后执行ESCrawlTopology类,它将在爬行过程中很早就产生异常(解析种子URL后)。
但是,如果我再次运行ESCrawlTopology(没有做任何其他事情),它会在很晚的时候产生异常。
编辑:当我从CollapsingSpout()更改为AggregationSpout()时,我现在得到此日志。
15409 [elasticsearch[_client_][listener][T#1]] INFO c.d.s.e.p.AggregationSpout - ES query returned 0 hits from 0 buckets in 2 msec with 0 already being processed
ES中没有任何内容被处理或编入索引。