Question

我在使用SimString Native库在hadoop集群上运行大数据（~15G）时遇到问题。但是，在中/小数据集（~200M）上，作业运行正常。在作业期间，SimString首先创建基于文件的数据库以匹配字符串，然后针对数据库中的字符串对给定的String执行匹配。作业完成后，它将删除基于文件的数据库。该作业以多线程（100线程）方式运行。

为作业执行创建了大约22个映射器，每个映射器运行100个线程。总的来说，机器的RAM是4G

错误日志如下：

14/02/12 00:15:53 INFO mapred.JobClient:  map 0% reduce 0%
14/02/12 00:16:13 INFO mapred.JobClient:  map 4% reduce 0%
14/02/12 00:16:24 INFO mapred.JobClient: Task Id : attempt_201402091522_0059_m_000001_0, Status : FAILED
java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 134.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # A fatal error has been detected by the Java Runtime Environment:
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: #  SIGSEGV (0xb) at pc=0x00007f6f1cd8827b, pid=21146, tid=140115055609600
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # JRE version: 6.0_45-b06
attempt_201402091522_0059_m_000001_0: # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.45-b01 mixed mode linux-amd64 compressed oops)
attempt_201402091522_0059_m_000001_0: # Problematic frame:
attempt_201402091522_0059_m_000001_0: # C  [libSimString.so+0x6c27b][thread 140115045103360 also had an error]
attempt_201402091522_0059_m_000001_0:   cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # An error report file with more information is saved as:
attempt_201402091522_0059_m_000001_0: # /app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201402091522_0059/attempt_201402091522_0059_m_000001_0/work/hs_err_pid21146.log
attempt_201402091522_0059_m_000001_0: [thread 140115070318336 also had an error]
attempt_201402091522_0059_m_000001_0: [thread 140114919028480 also had an error]
attempt_201402091522_0059_m_000001_0: [thread 140115089229568 also had an error]
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # If you would like to submit a bug report, please visit:
attempt_201402091522_0059_m_000001_0: #   http://java.sun.com/webapps/bugreport/crash.jsp
attempt_201402091522_0059_m_000001_0: # The crash happened outside the Java Virtual Machine in native code.
attempt_201402091522_0059_m_000001_0: # See problematic frame for where to report the bug.

问题似乎是在Native代码中导致的，如下所示：

cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f

但是，我不明白为什么这不会在小数据集中创建任何问题。我正在运行以下hadoop命令来执行：

hadoop jar hadoopjobs/job.jar Job -D mapred.child.java.opts=-Xss500k -D mapred.reduce.child.java.opts=-Xmx200m -files file1,file2,/home/hduser/libs/libSim/x64/libSimString.so -libjars /home/hduser/libs/Simstring.jar /datasources/XXX/spool/input datasources/XXX/spool/output

参考文献： SimString库：http://www.chokkan.org/software/simstring/

cdbpp源代码:: cdbpp_base :: get（void const *，unsigned long，unsigned long *）const + 0x16f：https://gitorious.org/copy-paste/copy-paste/commit/5d9c6b5b29fb2b1b8dd571260e7d50d9c42db9f9

Answer 1

问题可能不在于你的Murmur3哈希，而在于本机库及其如何分配内存。

我对JNI调用没有经验，但是在内存使用方面它们是有问题的（每个这样的调用分配堆栈和堆空间）。无法确定GC是否可以正确触发（阅读the horror stories about GZipInputStream）。

你说你创建了22 * 100个线程，每个线程可能为JNI调用分配一些堆栈，并且盒子里只有4Gb内存。机器似乎非常拥挤，我想这是CPU /内存访问，这是约束，而不是长外部等待（只有少数线程在并行中真正活跃）？

从根本上降低线程数量时会发生什么？如何使用SimStrings库？它是否有一个应该受到尊重的内部线程模型（即只让一个线程立即进行查询？）。

我担心JNI是单线程的。

详细了解native calls allocate memory。

Answer 2

正如我之前所说，问题在于在java中调用下面的方法：

 cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f

我每个映射器使用100个线程，总共有22个，其中2个用于并行运行。由于静态读取器曾经调用上述方法而没有“同步”，这就产生了这个问题。因此围绕此方法调用同步块解决了问题。

Hadoop作业在大数据上使用Native SimString C代码失败

2 个答案: