我需要并行加载HDFS文件并并行处理(根据某些条件读取并过滤它)每个文件。以下代码以串行方式加载文件。使用三个Worker(每个4个核心)运行Spark应用程序。我甚至尝试在parallelize方法中设置paration参数,但没有性能改进。我确定我的群集有足够的资源来并行运行作业。我应该做些什么改变才能使它平行?
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.closure.serializer", "org.apache.spark.serializer.JavaSerializer");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> files = sparkContext.parallelize(fileList);
Iterator<String> localIterator = files.toLocalIterator();
while (localIterator.hasNext())
{
String hdfsPath = localIterator.next();
long startTime = DateUtil.getCurrentTimeMillis();
JavaPairRDD<IntWritable, BytesWritable> hdfsContent = sparkContext.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class);
try
{
JavaRDD<Message> logs = hdfsContent.map(new Function<Tuple2<IntWritable, BytesWritable>, Message>()
{
public Message call(Tuple2<IntWritable, BytesWritable> tuple2) throws Exception
{
BytesWritable value = tuple2._2();
BytesWritable tmp = new BytesWritable();
tmp.setCapacity(value.getLength());
tmp.set(value);
return (Message) getProtos(logtype, tmp.getBytes());
}
});
final JavaRDD<Message> filteredLogs = logs.filter(new Function<Message, Boolean>()
{
public Boolean call(Message msg) throws Exception
{
FieldDescriptor fd = msg.getDescriptorForType().findFieldByName("method");
String value = (String) msg.getField(fd);
if (value.equals("POST"))
{
return true;
}
return false;
}
});
long timetaken = DateUtil.getCurrentTimeMillis() - startTime;
LOGGER.log(Level.INFO, "HDFS: {0} Total Log Count : {1} Filtered Log Count : {2} TimeTaken : {3}", new Object[] { hdfsPath, logs.count(), filteredLogs.count(), timetaken });
}
catch (Exception e)
{
LOGGER.log(Level.INFO, "Exception : ", e);
}
}
我没有迭代文件RDD,而是尝试了Spark函数,例如map&amp;的foreach。但它引发了Spark Exception。闭包内没有引用外部变量,My class(OldLogAnalyzer)已经实现了Serializable接口。 KryoSerializer和Javaserializer也在SparkConf中配置。我对我的代码中不可序列化的内容感到困惑。
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
at org.apache.spark.rdd.RDD.map(RDD.scala:286)
at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:81)
at org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:32)
at com.test.logs.spark.OldLogAnalyzer.main(OldLogAnalyzer.java:423)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.api.java.JavaSparkContext, value: org.apache.spark.api.java.JavaSparkContext@68f277a2)
- field (class: com.test.logs.spark.OldLogAnalyzer$10, name: val$sparkContext, type: class org.apache.spark.api.java.JavaSparkContext)
- object (class com.test.logs.spark.OldLogAnalyzer$10, com.test.logs.spark.OldLogAnalyzer$10@2f80b005)
- field (class: org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, type: interface org.apache.spark.api.java.function.Function)
- object (class org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 15 more