我想从网站抓取数据,所以我使用的是openweather.org的API。 我已配置为在数据流中传输的代理如下
weather.channels= memory-channel
weather.channels.memory-channel.capacity=10000
weather.channels.memory-channel.type = memory
weather.sinks = hdfs-write
weather.sinks.hdfs-write.channel=memory-channel
weather.sinks.hdfs-write.type = logger
weather.sinks.hdfs-write.hdfs.path = hdfs://localhost:8020/user/hadoop/flume/
weather.sinks.hdfs-write.rollInterval = 1200
weather.sinks.hdfs-write.hdfs.writeFormat=Text
weather.sinks.hdfs-write.hdfs.fileType=DataStream
weather.sources= Weather
weather.sources.Weather.bind = api.openweathermap.org/data/2.5/forecast/city?id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a
weather.sources.Weather.username= abc
weather.sources.Weather.password= ********
weather.sources.Weather.channels=memory-channel
weather.sources.Weather.type = http
weather.sources.Weather.port = 11111
当我使用以下命令运行水槽代理时 flume-ng agent -f weather.conf -n weather
我收到以下错误
15/03/23 05:17:34 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:weather.conf
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Added sinks: hdfs-write Agent: weather
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [weather]
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Creating channels
15/03/23 05:17:34 INFO channel.DefaultChannelFactory: Creating instance of channel memory-channel type memory
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Created channel memory-channel
15/03/23 05:17:34 INFO source.DefaultSourceFactory: Creating instance of sourceWeather, type http
15/03/23 05:17:35 INFO sink.DefaultSinkFactory: Creating instance of sink: hdfs-write, type: logger
15/03/23 05:17:35 INFO node.AbstractConfigurationProvider: Channel memory-channel connected to [Weather, hdfs-write]
15/03/23 05:17:35 INFO node.Application: Starting new configuration:{
sourceRunners:{Weather=EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTP
Source{name:Weather,state:IDLE} }} sinkRunners:{hdfs-write=SinkRunner: {
policy:org.apache.flume.sink.DefaultSinkProcessor@529d1dd7 counterGroup:{
name:null counters:{} } }} channels:{memory-
channel=org.apache.flume.channel.MemoryChannel{name: memory-channel}} }
15/03/23 05:17:35 INFO node.Application: Starting Channel memory-channel
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Monitored
countergroup for type: CHANNEL, name: memory-channel: Successfully
registered new MBean.
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Component
type: CHANNEL, name: memory-channel started
15/03/23 05:17:35 INFO node.Application: Starting Sink hdfs-write
15/03/23 05:17:35 INFO node.Application: Starting Source Weather
15/03/23 05:17:35 INFO mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
15/3/23 05:17:35 INFO mortbay.log: jetty-6.1.26
15/03/23 05:17:36 WARN mortbay.log: failed
SelectChannelConnector@api.openweathermap.org/data/2.5/forecast/city?
id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a:11111:
java.net.SocketException: Unresolved address
15/03/23 05:17:36 WARN mortbay.log: failed Server@642c189d:
java.net.SocketException: Unresolved address
15/03/23 05:17:36 ERROR http.HTTPSource: Error while starting HTTPSource.
Exception follows.java.net.SocketException: Unresolved address
at sun.nio.ch.Net.translateToSocketException(Net.java:157)
at sun.nio.ch.Net.translateException(Net.java:183)
at sun.nio.ch.Net.translateException(Net.java:189)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
at org.mortbay.jetty.nio.SelectChannelConnector.open
(SelectChannelConnector.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
nector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java_
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
ceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run
(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
... 15 more
15/03/23 05:17:36 ERROR lifecycle.LifecycleSupervisor: Unable to start
EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} }
- Exception follows.
java.lang.RuntimeException: java.net.SocketException: Unresolved address
at com.google.common.base.Throwables.propagate(Throwables.java:156)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:224)
at org.apache.flume.source.EventDrivenSourceRunner.start
(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
fecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Unresolved address
at sun.nio.ch.Net.translateToSocketException(Net.java:157)
at sun.nio.ch.Net.translateException(Net.java:183)
at sun.nio.ch.Net.translateException(Net.java:189)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnec
tor.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
nector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
... 9 more
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
... 15 more
15/03/23 05:17:39 ERROR lifecycle.LifecycleSupervisor: Unable to start
EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE}
} - Exception follows.
java.lang.IllegalStateException: Running HTTP Server found in source:
Weather before I started one.Will not attempt to start.
at com.google.common.base.Preconditions.checkState(Preconditions.java:14
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:189)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
ceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
fecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
^C15/03/23 05:17:41 INFO lifecycle.LifecycleSupervisor: Stopping
lifecycle supervisor 10
15/03/23 05:17:41 INFO node.PollingPropertiesFileConfigurationProvider:
Configuration provider stopping
请帮我解决这个问题?
或者在配置水槽代理之前我必须做其他事情。
或者我应该使用nutch来抓取数据,还是应该使用风暴。
请帮助我做什么是最好的选择
提前谢谢
答案 0 :(得分:1)
HTTPSource
的bind
参数指定代理将要侦听数据的IP地址或主机名。它不是爬网端点,而是爬网程序必须发送数据的端点(连同端口)。
据说,我建议使用Exec
源来执行一个脚本,该脚本可以抓取openweather.org并在输出端生成数据;然后,该输出将用作代理的输入数据。