我想使用Apache Beam处理从sparksession.sql(“ query”)检索的spark JavaRDD对象中的数据。但是我无法直接将PTransform应用于此数据集。 我正在使用Apache Beam 2.14.0(升级的Spark运行程序以使用Spark版本2.4.3(BEAM-7265))。为此,请指导我。
SparkSession session = SparkSession.builder().appName("test 2.0").master("local[*]").getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(session.sparkContext());
final SparkContextOptions options = PipelineOptionsFactory.as(SparkContextOptions.class);
options.setRunner(SparkRunner.class);
options.setUsesProvidedSparkContext(true);
options.setProvidedSparkContext(jsc);
options.setEnableSparkMetricSinks(false);
Pipeline pipeline = Pipeline.create(options);
List<StructField> srcfields = new ArrayList<StructField>();
srcfields.add(DataTypes.createStructField("dataId", DataTypes.IntegerType, true));
srcfields.add(DataTypes.createStructField("code", DataTypes.StringType, true));
srcfields.add(DataTypes.createStructField("value", DataTypes.StringType, true));
srcfields.add(DataTypes.createStructField("dataFamilyId", DataTypes.IntegerType, true));
StructType dataschema = DataTypes.createStructType(srcfields);
List<Row> dataList = new ArrayList<Row>();
dataList.add(RowFactory.create(1, "AA", "Apple", 1));
dataList.add(RowFactory.create(2, "AB", "Orange", 1));
dataList.add(RowFactory.create(3, "AC", "Banana", 2));
dataList.add(RowFactory.create(4, "AD", "Guava", 3));
Dataset<Row> rawData = new SQLContext(jsc).createDataFrame(dataList, dataschema);//pipeline.getOptions().getRunner().cast();
JavaRDD<Row> javadata = rawData.toJavaRDD();
System.out.println("***************************************************");
for(Row line:javadata.collect()){
System.out.println(line.getInt(0)+"\t"+line.getString(1)+"\t"+line.getString(2)+"\t"+line.getInt(3));
}
System.out.println("***************************************************");
pipeline.apply(Create.of(javadata))
.apply(ParDo.of(new DoFn<JavaRDD<Row>,String> ()
{
@ProcessElement
public void processElement(ProcessContext c) {
JavaRDD<Row> row = c.element();
c.output("------------------------------");
System.out.println(".............................");
}
}
))
.apply("WriteCounts", TextIO.write().to("E:\\output\\out"));
final PipelineResult result = pipeline.run();
System.out.println();
System.out.println("***********************************end");
答案 0 :(得分:2)
我不认为这是可能的,因为Beam应该对Spark RDD一无所知,并且Beam Spark Runner将所有与Spark相关的东西都隐藏在引擎盖下。潜在地,您可以创建自定义Spark特定的PTransform
,它将从RDD中读取,并将其用作特定情况下管道的输入,但是我不确定这是个好主意,也许可以解决以其他方式。您能否分享有关数据处理管道的更多详细信息?
答案 1 :(得分:0)
无法将Spark数据集或RDD直接消费到Beam中,但是您应该能够将Hive中的数据摄取到Beam PCollection中。请参阅Beam的HCatalog IO连接器的文档:https://beam.apache.org/documentation/io/built-in/hcatalog/