nutch crawling陷入spinwaiting或active。如何减少获取周期?

时间:2013-01-02 05:27:15

标签: nutch web-crawler spinwait

我正在使用nutch 2.1并抓取一个网站。问题是爬虫不断显示提取url spinwaiting / active,因为提取需要花费很多时间才能与mysql连接得到timedout。我怎样才能一次减少提取次数,以便mysql不会超时?在nutch中有一个设置我可以说只获取100或500个urls然后解析并存储到mysql然后再次获取下一个100或500个URL?

错误讯息:

Unexpected error for http://www.example.com
java.io.IOException: java.sql.BatchUpdateException: The last packet successfully received from the server was 36,928,172 milliseconds ago.  The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
    at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
    at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:587)
    at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.output(FetcherReducer.java:663)
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:534)
Caused by: java.sql.BatchUpdateException: The last packet successfully received from the server was 36,928,172 milliseconds ago.  The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
    at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
    at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
    ... 5 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet successfully received from the server was 36,928,172 milliseconds ago.  The last packet sent successfully to the server was 36,928,172 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
    at sun.reflect.GeneratedConstructorAccessor49.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
    at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1116)
    at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3364)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1983)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
    at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2427)
    at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:1980)
    ... 7 more
Caused by: java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3345)
    ... 13 more

1 个答案:

答案 0 :(得分:1)

  

我正在使用nutch 2.1并抓取一个网站。问题是,   爬虫继续显示获取网址spinwaiting / active和自从    获取需要花费很多时间才能获得与mysql的连接。怎么样   我可以一次减少提取的数量,以便mysql这样做    没有得到超时?

为减少提取次数,您可以将以下属性添加到您的nutch-site.xml中,并根据需要编辑该值。请不要编辑nutch-default.xml,而是将属性复制到nutch-site.xml并从那里管理值:

  <property>
    <name>fetcher.threads.fetch</name>
    <value>20</value>
  </property>

关于超时问题,您可以将此属性添加到nutch-site.xml,其值为您认为需要的加载时间。

<property>
  <name>http.timeout</name>
  <value>240000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>
  

在nutch中是否有设置我可以说只获取100或500个url然后解析并存储到mysql然后再次获取下一个100或500个URL?

Nutch在一个循环中使用步骤进行爬行 - 在您的爬网命令中指定的称为“深度”的多个迭代中生成/获取/解析/更新。如果您希望对爬网进行控制,则可以按照教程链接http://wiki.apache.org/nutch/NutchTutorial的第3.2节(使用单个命令进行整个Web爬网)中的说明执行每个步骤。这将为您提供良好的指导并准确了解正在发生的事情。在获取每个段时检查状态,以便您知道每个段中提取的URL数量