How to develop a custom Spark data source in Java?

时间:2017-03-22 18:52:03

标签: java apache-spark apache-spark-sql

I am trying to build a custom file data source for Spark, in Java. I have found numerous examples in Scala (including the CSV and XML data sources from Databricks), but I cannot bring Scala in this project. We also already have the parser itself written in Java, I just need to build the "glue" between the parser and Spark.

This is how I'd like to call it:

    String filename = "src/test/resources/simple.x";

    SparkSession spark = SparkSession.builder().appName("X-parse").master("local").getOrCreate();

    Dataset<Row> df = spark.read().format("x.RandomDataSource")
            .option("metadataTag", "schema") // hint to find schema
            .option("dataTag", "data") // hint to find data
            .load(filename); // local file

So far, I tried is implement x.RandomDataSource:

  1. Based on FileFormat, which makes the most sense, but I do not have a clue on how to build buildReader()...
  2. Based on RelationProvider, but same here...

It seems that in both case, the call is made to the right class, but I get into NPE because I do not provide much. Any hint or example would be greatly apreciated!

Update #1

I simplified my project, my call is:

String filename = "src/test/resources/simple.json";
SparkSession spark = SparkSession.builder().appName("X-parse").master("local").getOrCreate();
Dataset<Row> df = spark.read().format("x.CharCounterDataSource")
    .option("char", "a") // count the number of 'a'
    .load(filename); // local file (line 40 in the stacks below)
df.show();

Ideally, this should display something like:

+--+
| a|
+--+
|45|
+--+

Things gets trickier when I try to work on x.CharCounterDataSource:

I looked at 2 ways to do it:

1) one based on FileFormat:

public class CharCounterDataSource implements FileFormat {

    @Override
    public Function1<PartitionedFile, Iterator<InternalRow>> buildReader(SparkSession arg0, StructType arg1,
            StructType arg2, StructType arg3, Seq<Filter> arg4, Map<String, String> arg5, Configuration arg6) {
        // TODO Auto-generated method stub
        return null;
    }

    @Override
    public Function1<PartitionedFile, Iterator<InternalRow>> buildReaderWithPartitionValues(SparkSession arg0,
            StructType arg1, StructType arg2, StructType arg3, Seq<Filter> arg4, Map<String, String> arg5,
            Configuration arg6) {
        // TODO Auto-generated method stub
        return null;
    }

    @Override
    public Option<StructType> inferSchema(SparkSession arg0, Map<String, String> arg1, Seq<FileStatus> arg2) {
        // TODO Auto-generated method stub
        return null;
    }

    @Override
    public boolean isSplitable(SparkSession arg0, Map<String, String> arg1, Path arg2) {
        // TODO Auto-generated method stub
        return false;
    }

    @Override
    public OutputWriterFactory prepareWrite(SparkSession arg0, Job arg1, Map<String, String> arg2, StructType arg3) {
        // TODO Auto-generated method stub
        return null;
    }

    @Override
    public boolean supportBatch(SparkSession arg0, StructType arg1) {
        // TODO Auto-generated method stub
        return false;
    }
}

I know it is an empty class (generated by Eclipse) and I am not expecting much out of it.

Running it says:

java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
    at x.spark.datasource.counter.CharCounterDataSourceTest.test(CharCounterDataSourceTest.java:40)

Nothing surprising...

2) One based on RelationProvider:

public class CharCounterDataSource implements RelationProvider {

    @Override
    public BaseRelation createRelation(SQLContext arg0, Map<String, String> arg1) {
        // TODO Auto-generated method stub
        return null;
    }

}

which fails too...

java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:40)
    at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:389)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
    at x.CharCounterDataSourceTest.test(CharCounterDataSourceTest.java:40)

Don't get me wrong - I understand it fails - but what I need is "just one hint" to continue building the glue ;-)...

(Un)fortunately, we cannot use Scala...

0 个答案:

没有答案