将数据从网站爬网到hdfs

时间:2015-03-23 06:35:28

标签: web-crawler hdfs nutch apache-storm flume

我想从网站抓取数据,所以我使用的是openweather.org的API。 我已配置为在数据流中传输的代理如下

weather.channels= memory-channel
weather.channels.memory-channel.capacity=10000
weather.channels.memory-channel.type = memory
weather.sinks = hdfs-write
weather.sinks.hdfs-write.channel=memory-channel
weather.sinks.hdfs-write.type = logger
weather.sinks.hdfs-write.hdfs.path = hdfs://localhost:8020/user/hadoop/flume/
weather.sinks.hdfs-write.rollInterval = 1200
weather.sinks.hdfs-write.hdfs.writeFormat=Text
weather.sinks.hdfs-write.hdfs.fileType=DataStream
weather.sources= Weather
weather.sources.Weather.bind =     api.openweathermap.org/data/2.5/forecast/city?id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a
weather.sources.Weather.username= abc
weather.sources.Weather.password= ********
weather.sources.Weather.channels=memory-channel
weather.sources.Weather.type = http
weather.sources.Weather.port = 11111

当我使用以下命令运行水槽代理时    flume-ng agent -f weather.conf -n weather

我收到以下错误

15/03/23 05:17:34 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:weather.conf
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Added sinks: hdfs-write Agent: weather
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [weather]
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Creating channels
15/03/23 05:17:34 INFO channel.DefaultChannelFactory: Creating instance of channel memory-channel type memory
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Created channel memory-channel
15/03/23 05:17:34 INFO source.DefaultSourceFactory: Creating instance of sourceWeather, type http
15/03/23 05:17:35 INFO sink.DefaultSinkFactory: Creating instance of sink: hdfs-write, type: logger
15/03/23 05:17:35 INFO node.AbstractConfigurationProvider: Channel memory-channel connected to [Weather, hdfs-write]
15/03/23 05:17:35 INFO node.Application: Starting new configuration:{     
sourceRunners:{Weather=EventDrivenSourceRunner: {    
source:org.apache.flume.source.http.HTTP
Source{name:Weather,state:IDLE} }} sinkRunners:{hdfs-write=SinkRunner: {   
policy:org.apache.flume.sink.DefaultSinkProcessor@529d1dd7 counterGroup:{    
name:null counters:{} } }} channels:{memory-   
channel=org.apache.flume.channel.MemoryChannel{name: memory-channel}} }
15/03/23 05:17:35 INFO node.Application: Starting Channel memory-channel
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Monitored  
countergroup for type: CHANNEL, name: memory-channel: Successfully  
registered new MBean.
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Component   
type: CHANNEL, name: memory-channel started
15/03/23 05:17:35 INFO node.Application: Starting Sink hdfs-write
15/03/23 05:17:35 INFO node.Application: Starting Source Weather
15/03/23 05:17:35 INFO mortbay.log: Logging to 
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via   
org.mortbay.log.Slf4jLog
15/3/23 05:17:35 INFO mortbay.log: jetty-6.1.26
15/03/23 05:17:36 WARN mortbay.log: failed 
SelectChannelConnector@api.openweathermap.org/data/2.5/forecast/city?
id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a:11111:   
java.net.SocketException: Unresolved address
15/03/23 05:17:36 WARN mortbay.log: failed Server@642c189d: 
java.net.SocketException: Unresolved address
15/03/23 05:17:36 ERROR http.HTTPSource: Error while starting HTTPSource.    
  Exception follows.java.net.SocketException: Unresolved address
    at sun.nio.ch.Net.translateToSocketException(Net.java:157)
    at sun.nio.ch.Net.translateException(Net.java:183)
    at sun.nio.ch.Net.translateException(Net.java:189)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
    at org.mortbay.jetty.nio.SelectChannelConnector.open
    (SelectChannelConnector.java:216)
    at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
    nector.java:315)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java_
    at org.mortbay.jetty.Server.doStart(Server.java:235)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java)
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
    at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
    ceRunner.java:44)
    at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run
    (LifecycleSupervisor.java:251)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
    java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
    .java:615)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:127)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    ... 15 more
   15/03/23 05:17:36 ERROR lifecycle.LifecycleSupervisor: Unable to start 
   EventDrivenSourceRunner: {   
   source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} } 
   - Exception follows.
   java.lang.RuntimeException: java.net.SocketException: Unresolved address
    at com.google.common.base.Throwables.propagate(Throwables.java:156)
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:224)
    at org.apache.flume.source.EventDrivenSourceRunner.start
    (EventDrivenSourceRunner.java:44)
    at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
    fecycleSupervisor.java:251)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
    java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
    .java:615)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.SocketException: Unresolved address
    at sun.nio.ch.Net.translateToSocketException(Net.java:157)
    at sun.nio.ch.Net.translateException(Net.java:183)
    at sun.nio.ch.Net.translateException(Net.java:189)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnec
    tor.java:216)
    at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
    nector.java:315)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
    at org.mortbay.jetty.Server.doStart(Server.java:235)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
    ... 9 more
    Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:127)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    ... 15 more
    15/03/23 05:17:39 ERROR lifecycle.LifecycleSupervisor: Unable to start 
    EventDrivenSourceRunner: {   
    source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} 
    } - Exception follows.
    java.lang.IllegalStateException: Running HTTP Server found in source:  
    Weather before I started one.Will not attempt to start.
    at com.google.common.base.Preconditions.checkState(Preconditions.java:14
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:189)
    at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
    ceRunner.java:44)
    at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
    fecycleSupervisor.java:251)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
    java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
    .java:615)
    at java.lang.Thread.run(Thread.java:745)
    ^C15/03/23 05:17:41 INFO lifecycle.LifecycleSupervisor: Stopping  
    lifecycle supervisor 10
    15/03/23 05:17:41 INFO node.PollingPropertiesFileConfigurationProvider:  
    Configuration provider stopping

请帮我解决这个问题?

或者在配置水槽代理之前我必须做其他事情。

或者我应该使用nutch来抓取数据,还是应该使用风暴。

请帮助我做什么是最好的选择

提前谢谢

1 个答案:

答案 0 :(得分:1)

HTTPSourcebind参数指定代理将要侦听数据的IP地址或主机名。它不是爬网端点,而是爬网程序必须发送数据的端点(连同端口)。

据说,我建议使用Exec源来执行一个脚本,该脚本可以抓取openweather.org并在输出端生成数据;然后,该输出将用作代理的输入数据。