我正在将nutch 1.15与hadoop 2.7.2群集(标准A4m v2(4 vcpus,32 GB内存),具有1个主/从节点和3个从属节点一起使用,我尝试抓取9个URL并配置了mapreduce.job.reduces = 10,topN = 9000(1000 urls / host),depth = 4,generate.max.count = 1000 and mapreduce.map/reduce.java.opts= -Xmx4g 根据我的理解,爬网生成将分为10个部分,每个主机一个,每个部分将包含(最大网址= 9000/10 = 900个网址) 但是我看到的是很少有部分没有条目,并且很少有具有来自多个主机的URL的部分,我的理解是正确的 只需35分钟即可完成上述配置的抓取
当我在没有hadoop / hdfs的本地模式下运行9个url(此本地模式的4个实例)时,爬网在23分钟内完成 本地模式的配置mapreduce.job.reduces = N / A topN = 1000(因为一次选择了一个URL /主机)depth = 4,generate.max.count = 1000
下面是日志摘要,我可以在其中看到地图输出记录始终是溢出记录的一半
Map input records=814
Map output records=2025
Map output bytes=96541492
Map output materialized bytes=45547532
Input split bytes=648
Combine input records=0
Combine output records=0
Reduce input groups=981
Reduce shuffle bytes=45547532
Reduce input records=2025
Reduce output records=2025
Spilled Records=4050
Shuffled Maps =8
Failed Shuffles=0
Merged Map outputs=8
GC time elapsed (ms)=5260
CPU time spent (ms)=143320
Physical memory (bytes) snapshot=7797211136
Virtual memory (bytes) snapshot=38557540352
Total committed heap usage (bytes)=8712617984
及以下是我的mapred-site.xml配置
<property>
<name>mapreduce.map.memory.mb</name>
<value>5012</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>5012</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx4g -XX:+UseCompressedOops</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx4g -XX:+UseCompressedOops</value>
</property>
<property>
<name>mapreduce.job.maps</name>
<value>2</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapreduce.job.reduces</name>
<value>4</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>10</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>1048</value>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>
我在做错什么或我想念什么?