我在使用SimString Native库在hadoop集群上运行大数据(~15G)时遇到问题。但是,在中/小数据集(~200M)上,作业运行正常。在作业期间,SimString首先创建基于文件的数据库以匹配字符串,然后针对数据库中的字符串对给定的String执行匹配。作业完成后,它将删除基于文件的数据库。该作业以多线程(100线程)方式运行。
为作业执行创建了大约22个映射器,每个映射器运行100个线程。总的来说,机器的RAM是4G
错误日志如下:
14/02/12 00:15:53 INFO mapred.JobClient: map 0% reduce 0%
14/02/12 00:16:13 INFO mapred.JobClient: map 4% reduce 0%
14/02/12 00:16:24 INFO mapred.JobClient: Task Id : attempt_201402091522_0059_m_000001_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 134.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # A fatal error has been detected by the Java Runtime Environment:
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # SIGSEGV (0xb) at pc=0x00007f6f1cd8827b, pid=21146, tid=140115055609600
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # JRE version: 6.0_45-b06
attempt_201402091522_0059_m_000001_0: # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.45-b01 mixed mode linux-amd64 compressed oops)
attempt_201402091522_0059_m_000001_0: # Problematic frame:
attempt_201402091522_0059_m_000001_0: # C [libSimString.so+0x6c27b][thread 140115045103360 also had an error]
attempt_201402091522_0059_m_000001_0: cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # An error report file with more information is saved as:
attempt_201402091522_0059_m_000001_0: # /app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201402091522_0059/attempt_201402091522_0059_m_000001_0/work/hs_err_pid21146.log
attempt_201402091522_0059_m_000001_0: [thread 140115070318336 also had an error]
attempt_201402091522_0059_m_000001_0: [thread 140114919028480 also had an error]
attempt_201402091522_0059_m_000001_0: [thread 140115089229568 also had an error]
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # If you would like to submit a bug report, please visit:
attempt_201402091522_0059_m_000001_0: # http://java.sun.com/webapps/bugreport/crash.jsp
attempt_201402091522_0059_m_000001_0: # The crash happened outside the Java Virtual Machine in native code.
attempt_201402091522_0059_m_000001_0: # See problematic frame for where to report the bug.
问题似乎是在Native代码中导致的,如下所示:
cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f
但是,我不明白为什么这不会在小数据集中创建任何问题。我正在运行以下hadoop命令来执行:
hadoop jar hadoopjobs/job.jar Job -D mapred.child.java.opts=-Xss500k -D mapred.reduce.child.java.opts=-Xmx200m -files file1,file2,/home/hduser/libs/libSim/x64/libSimString.so -libjars /home/hduser/libs/Simstring.jar /datasources/XXX/spool/input datasources/XXX/spool/output
参考文献: SimString库:http://www.chokkan.org/software/simstring/
cdbpp源代码:: cdbpp_base :: get(void const *,unsigned long,unsigned long *)const + 0x16f:https://gitorious.org/copy-paste/copy-paste/commit/5d9c6b5b29fb2b1b8dd571260e7d50d9c42db9f9
答案 0 :(得分:0)
问题可能不在于你的Murmur3哈希,而在于本机库及其如何分配内存。
我对JNI调用没有经验,但是在内存使用方面它们是有问题的(每个这样的调用分配堆栈和堆空间)。无法确定GC是否可以正确触发(阅读the horror stories about GZipInputStream)。
你说你创建了22 * 100个线程,每个线程可能为JNI调用分配一些堆栈,并且盒子里只有4Gb内存。机器似乎非常拥挤,我想这是CPU /内存访问,这是约束,而不是长外部等待(只有少数线程在并行中真正活跃)?
从根本上降低线程数量时会发生什么?如何使用SimStrings库?它是否有一个应该受到尊重的内部线程模型(即只让一个线程立即进行查询?)。
我担心JNI是单线程的。
答案 1 :(得分:0)
正如我之前所说,问题在于在java中调用下面的方法:
cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f
我每个映射器使用100个线程,总共有22个,其中2个用于并行运行。由于静态读取器曾经调用上述方法而没有“同步”,这就产生了这个问题。因此围绕此方法调用同步块解决了问题。