我有一个从Kafka读取的流式作业(@ 1min批处理),并在一些操作后将其POST到HTTP端点。每隔几个小时它就会陷入“处理”阶段并开始排队工作:
在检查了正在运行的'Executors'(在app-UI页面中)后,我发现了这一点 6个执行者中有1个显示2个“活动任务”。
为此进行线程转储后,它显示了“Executor task launch worker”线程池(source)的2个线程。这些线程都陷入了同样的错误:
完全可读的错误:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
java.net.InetAddress.checkLookupTable(InetAddress.java:1393)
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1310)
java.net.InetAddress.getAllByName0(InetAddress.java:1276)
java.net.InetAddress.getAllByName(InetAddress.java:1192)
java.net.InetAddress.getAllByName(InetAddress.java:1126)
java.net.InetAddress.getByName(InetAddress.java:1076)
java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
kafka.network.BlockingChannel.liftedTree1$1(BlockingChannel.scala:59)
kafka.network.BlockingChannel.connect(BlockingChannel.scala:49)
kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:44)
kafka.consumer.SimpleConsumer.getOrMakeConnection(SimpleConsumer.scala:151)
kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:69)
kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:68)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:112)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:112)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:112)
kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:111)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111)
kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:110)
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:209)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
这似乎是JDK bug,必须在JDK 7中修复 - 我确保我使用的是'1.8.0_101(Oracle Corporation)'。我尝试在命令行上添加以下内容(如建议的here),但它没有解决问题:
-Djava.net.preferIPv4Stack=true -Dnetworkaddress.cache.ttl=60
有没有人对调试/修复此方法有任何想法?
*编辑:重命名问题以消除令人困惑的JDK原因
答案 0 :(得分:2)
原来是一个内核级别的错误https://bugzilla.redhat.com/show_bug.cgi?id=1209433,它在Linux内核版本4.0.6中得到解决,而我的工作者正在运行的主机上有RHEL,内核版本为3.5.6。希望在使用内核版本4.5的新CentOS机器上部署后,它不会成为一个问题。
我是如何弄明白的,每次它都被卡在“查看表”中。或者&#39; lookupAllHostAddr&#39;,两者都是对底层操作系统的本机(JNI)调用。