我正在尝试使用Java从kafka中读取复杂的嵌套JSON数据,但无法形成数据集
发送到kafka的实际JSON文件
{"sample_title": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
{"sample_title2": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
{"sample_title3": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}
Dataset<Row> df = spark.readStream().format("kafka")
.option("spark.local.dir", config.getString(PropertyKeys.SPARK_APPLICATION_TEMP_LOCATION.getCode()))
.option("kafka.bootstrap.servers",
config.getString(PropertyKeys.KAFKA_BOORTSTRAP_SERVERS.getCode()))
.option("subscribe", config.getString(PropertyKeys.KAFKA_TOPIC_IPE_STP.getCode()))
.option("startingOffsets", "earliest")
.option("spark.default.parallelism",
config.getInt(PropertyKeys.SPARK_APPLICATION_DEFAULT_PARALLELISM_VALUE.getCode()))
.option("spark.sql.shuffle.partitions",
config.getInt(PropertyKeys.SPARK_APPLICATION_SHUFFLE_PARTITIONS_COUNT.getCode()))
.option("kafka.security.protocol", config.getString(PropertyKeys.SECURITY_PROTOCOL.getCode()))
.option("kafka.ssl.truststore.location",
config.getString(PropertyKeys.SSL_TRUSTSTORE_LOCATION.getCode()))
.option("kafka.ssl.truststore.password",
config.getString(PropertyKeys.SSL_TRUSTSTORE_PASSWORD.getCode()))
.option("kafka.ssl.keystore.location",
config.getString(PropertyKeys.SSL_KEYSTORE_LOCATION.getCode()))
.option("kafka.ssl.keystore.password",
config.getString(PropertyKeys.SSL_KEYSTORE_PASSWORD.getCode()))
.option("kafka.ssl.key.password", config.getString(PropertyKeys.SSL_KEY_PASSWORD.getCode())).load()
.selectExpr("CAST(key AS STRING)",
"CAST(value AS STRING)",
"topic as topic",
"partition as partition","offset as offset",
"timestamp as timestamp",
"timestampType as timestampType");
val output = df.selectExpr("CAST(value AS STRING)").as(Encoders.STRING()).filter(x -> x.contains("sample_title"));
由于我可以在输入中包含多个架构,因此代码应该能够处理该架构并根据标题进行过滤,并映射到Title类型的数据集
public class Title implements Serializable {
String txn_date;
Timestamp timestamp;
String txn_type;
String txn_rcvd_time;
String txn_ref;
String txn_status;
}
答案 0 :(得分:0)
首先将类Title设置为Java bean类,即编写getter和setter。
public class Title implements Serializable {
String txn_date;
Timestamp timestamp;
String txn_type;
String txn_rcvd_time;
String txn_ref;
String txn_status;
public Title(String data){... //set values for fields with the data}
// add all getters and setters for fields
}
Dataset<Title> resultdf = df.selectExpr("CAST(value AS STRING)").map(value -> new Title(value), Encoders.bean(Title.class))
resultdf.filter(title -> // apply any predicate on title)
如果要先过滤数据然后再应用编码,
df.selectExpr("CAST(value AS STRING)")
.filter(get_json_object(col("value"), "$.sample_title").isNotNull)
// for simple filter use, .filter(t-> t.contains("sample_title"))
.map(value -> new Title(value), Encoders.bean(Title.class))