我使用Stromcrawler和SQL外部模块。我用以下内容更新了我的pop.xml:
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-sql</artifactId>
<version>1.8</version>
</dependency>
我使用类似于ES设置的注射器/爬行程序:
storm jar target/stromcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000
我创建了mysql数据库crawl
,表urls
并成功注入了我的网址。例如,如果我select * from crawl.urls limit 5;
,我可以看到网址,状态和其他字段。由此,我得出结论,在此阶段,爬虫连接到数据库。
Sql-injector看起来像这样:
name: "injector"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "sql-conf.yaml"
override: true
- resource: false
file: "my-config.yaml"
override: true
components:
- id: "scheme"
className: "com.digitalpebble.stormcrawler.util.StringTabScheme"
constructorArgs:
- DISCOVERED
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "seeds.txt"
- ref: "scheme"
bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "status"
grouping:
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byHost"
当我跑步时:
storm jar target/stromcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote sql-crawler.flux
我在Parse bolt中遇到以下异常:
java.lang.RuntimeException:在com.digitalpebble.stormcrawler.bolt.JSoupParserBolt.prepare的com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:67)中从parsefilters.json加载ParseFilters时捕获到异常。 (JSoupParserBolt.java:116)org.apache.storm.daemon.executor $ fn__5043 $ fn__5056.invoke(executor.clj:803)org.apache.storm.util $ async_loop $ fn__557.invoke(util.clj:482) )at java.lang.Thread.run(Thread.java:745)中的clojure.lang.AFn.run(AFn.java:22)引起:java.io.IOException:无法从com中的文件构建JSON对象。 digitalpebble.stormcrawler.parse.ParseFilters。(ParseFilters.java:92)at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:62)... 5更多引起:com.fasterxml.jackson.core。 JsonParseException:意外的字符('}'(代码125)):期望双引号开始字段名称...
SQL-crawler.flux:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "sql-conf.yaml"
override: true
- resource: false
file: "my-config.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.sql.SQLSpout"
parallelism: 100
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
看起来ParseFilters.java:60中的对象StringUtils
是空白的。
答案 0 :(得分:0)
检查 src / main / resources.parsefilters.json 的内容(或者您可能为 parsefilters.config.file 设置的值),根据错误判断消息,它包含的JSON无效。不要忘记用mvn clean package