我正在使用Spark 2.1.0版本
我正在尝试将超过1500列的BigTable CSV文件加载到我们的系统中。
我们的流程:
我在网上搜索某些解决方案甚至是类似的用例, 发现很少有人谈到64KB错误,但所有案例都处理了100列,并通过缩小生成的代码解决了Spark 2.1.0版本, 但它们都没有达到JVM限制
非常感谢来自这个专家论坛的任何想法
我们正在寻找两种类型的解决方案:
我们的时间解决方案:
代码再现了问题
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import poc.commons.SparkSessionInitializer;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.stream.IntStream;
public class RDDConverter {
private static final int FIELD_COUNT = 1900;
private Dataset<Row> createBigSchema(SparkSession sparkSession , int startColName, int fieldNumber) {
JavaSparkContext jsc = new JavaSparkContext(sparkSession.sparkContext());
SQLContext sqlContext = new SQLContext(sparkSession.sparkContext());
String[] row = IntStream.range(startColName, fieldNumber).mapToObj(String::valueOf).toArray(String[]::new);
List<String[]> data = Collections.singletonList(row);
JavaRDD<Row> rdd = jsc.parallelize(data).map(RowFactory::create);
StructField[] structFields = IntStream.range(startColName, fieldNumber)
.mapToObj(i -> new StructField(String.valueOf(i), DataTypes.StringType, true, Metadata.empty()))
.toArray(StructField[]::new);
StructType schema = DataTypes.createStructType(structFields);
Dataset<Row> dataSet = sqlContext.createDataFrame(rdd, schema);
dataSet.show();
return dataSet;
}
public static void main(String[] args) {
SparkSessionInitializer sparkSessionInitializer = new SparkSessionInitializer();
SparkSession sparkSession = sparkSessionInitializer.init();
RDDConverter rddConverter = new RDDConverter();
rddConverter.createBigSchema(sparkSession, 0, FIELD_COUNT);
}
}
我们得到的例外情况:
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:893)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:950)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:947)
at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
... 39 common frames omitted
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)