我正在尝试执行ETL,这涉及从HDFS加载文件,应用转换并将其写入Hive。在使用SqlTransforms通过遵循this文档执行转换时,遇到了以下问题。你能帮忙吗?
java.lang.IllegalStateException: Cannot call getSchema when there is no schema
at org.apache.beam.sdk.values.PCollection.getSchema(PCollection.java:328)
at org.apache.beam.sdk.extensions.sql.impl.schema.BeamPCollectionTable.<init>(BeamPCollectionTable.java:34)
at org.apache.beam.sdk.extensions.sql.SqlTransform.toTableMap(SqlTransform.java:105)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:90)
at org.apache.beam.sdk.extensions.sql.SqlTransform.expand(SqlTransform.java:77)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:471)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:339)
at org.apache.beam.examples.SqlTest.runSqlTest(SqlTest.java:107)
at org.apache.beam.examples.SqlTest.main(SqlTest.java:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
代码段:
PCollection<String> data = p.apply("ReadLines", TextIO.read().from(options.getInputFile()));
if(options.getOutput().equals("hive")){
Schema hiveTableSchema = Schema.builder()
.addStringField("eid")
.addStringField("name")
.addStringField("salary")
.addStringField("destination")
.build();
data.apply(ParDo.of(new DoFn<String, Row>() {
@ProcessElement
public void processElement(@Element String input, OutputReceiver<Row> out){
String[] values = input.split(",");
System.out.println(values);
Row row = Row.withSchema(hiveTableSchema)
.addValues(values)
.build();
out.output(row);
}
})).apply(SqlTransform.query("select eid, destination from PCOLLECTION"))
.apply(ParDo.of(new DoFn<Row, HCatRecord>() {
@ProcessElement
public void processElement(@Element Row input, OutputReceiver<HCatRecord> out){
HCatRecord record = new DefaultHCatRecord(input.getFieldCount());
for(int i=0; i < input.getFieldCount(); i++){
record.set(i, input.getString(i));
}
out.output(record);
}
}))
.apply("WriteData", HCatalogIO.write()
.withConfigProperties(configProperties)
.withDatabase("wmrpoc")
.withTable(options.getOutputTableName()));
答案 0 :(得分:0)
您似乎需要在PCollection
上设置架构。在演练中,您链接了Create...withCoder()
来处理该问题。在无法从您的ParDo
推断出架构的情况下,Beam可能会查看的唯一信息是它输出类型为Row
的元素,但是对于您的{{1} }甚至对所有输出都遵循单一模式。因此,在应用ParDo
之前,请先致电pcollection.setRowSchema()
,以告诉Beam您打算从转换SqlTransform
中摆脱出来的方案。
更新
并且看来您在ParDo
之前所做的大部分工作最终都可能会简化很多,例如假设您只需要指定HCatalog
之类的内容。实际上,Beam SQL支持读取CSV文件而无需进行额外的转换pipeline.apply(TextIO.readCsvRows(schema)).apply(sqlTransform)...
(通过TextTableProvider
),但尚未连接到ParDos
,并且只能通过Beam SQL CLI进行访问