在尝试通过spark-mongo连接器将数据从S3移动到Mongo并使用SparkSQL进行转换时,我不得不将列从字符串转换为UUID。该列在S3中存储为字符串,我正在寻找适当的转换函数,以便在保存到Mongo时将其存储为UUID。
尝试使用udf但无法从数据框中读取特定列并将字符串值转换为uuid。关于如何写一个火花udf的任何建议?
来自S3文件的示例输入: key1 string,key2 string,key2_type int
Mongo的预期输出: key1 UUID,key2字符串,key2_type int
目前我们使用SparkSQL转换从S3保存到Mongo
sourceMap = sourceMap ++ jsonObjectPropertiesToMap(List("s3path", "fileformat", "awsaccesskeyid", "awssecretaccesskey"), source)
sparkSession.sparkContext.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive" , "true")
setAWSCredentials (sparkSession, sourceMap);
df = s3ToDataFrame(sourceMap("s3path"), sourceMap("fileformat"), sparkSession)
val dft = sparkSession.sql(mappingsToTransformedSQL(mappings))
destinationMap = destinationMap ++ jsonObjectPropertiesToMap(List("cluster", "database", "authenticationdatabase","collection", "login", "password"), destination)
dataFrameToMongodb(destinationMap("cluster"), destinationMap("database"), destinationMap("authenticationdatabase"),destinationMap("collection"),destinationMap("login"),destinationMap("password"), dft)

以下是stringtoUUID
的推荐功能
def stringToUUID(uuid : String):String = {
java.util.UUID.fromString(
uuid
.replaceFirst(
"(\\p{XDigit}{8})(\\p{XDigit}{4})(\\p{XDigit}{4})(\\p{XDigit}{4})(\\p{XDigit}+)", "$1-$2-$3-$4-$5"
)
).toString
}
val stringToUUIDUdf = udf((uuid: String) => stringToUUID(uuid))
dft.withColumn("key1", stringToUUIDUdf(df("key1")))

这是我们得到的错误
17/07/01 17:51:05 INFO SparkSqlParser: Parsing command: Select key1 AS key1,key1_type_id AS key1_type_id,key2 AS key2,key2_type_id AS key2_type_id,site AS site,updated AS updated FROM tmp
org.apache.spark.sql.AnalysisException: resolved attribute(s) key1#1 missing from key2#19,updated#22,site#21,key1#17,key1_type_id#18,key2_type_id#20 in operator !Project [UDF(key1#1) AS key1#30, key1_type_id#18, key2#19, key2_type_id#20, site#21, updated#22];;
!Project [UDF(key1#1) AS key1#30, key1_type_id#18, key2#19, key2_type_id#20, site#21, updated#22]
+- Project [key1#1 AS key1#17, key1_type_id#2 AS key1_type_id#18, key2#3 AS key2#19, key2_type_id#4 AS key2_type_id#20, site#5 AS site#21, updated#6 AS updated#22]
+- SubqueryAlias tmp, `tmp`
+- Relation[key1#1,key1_type_id#2,key2#3,key2_type_id#4,site#5,updated#6,pdateid#7] parquet

答案 0 :(得分:1)
从定义Scala函数开始:
def stringToUUID(uuid: String): String = {
java.util.UUID.fromString(
uuid
.replaceFirst(
"(\\p{XDigit}{8})(\\p{XDigit}{4})(\\p{XDigit}{4})(\\p{XDigit}{4})(\\p{XDigit}+)", "$1-$2-$3-$4-$5"
)
).toString
}
基于以上功能创建UDF:
val stringToUUIDUdf = udf((uuid: String) => stringToUUID(uuid))
使用withColumn
转换添加新的uuid列:
df.withColumn("uuid", stringToUUIDUdf(df("text")))
您还可以使用select
转换:
df.select(stringToUUIDUdf(df("text")).alias("uuid"))
示例:强>
val df = session.createDataset(Seq(
"7158e7a4c1284697bcab58dfb8c80e66",
"cf251f4c667c46b3a9f67681f3be2338",
"42d3ee515d8c4268b47b579170c88e4c",
"6b7e3222292d4dc5a8a369f7fede7dc4",
"b371896d39d04fbb8a8646a176e60d17",
"e2b57f1677154c5bbe181a575aba4684",
"2a2e11c4cc604673bbd13b22f029dabb",
"fcad3f649a114336a721fc3eaefd6ce1",
"f3f6fcfd16394e1e9c98aae0bd062432",
"8b0e1929e335489997bfca20bb021d62"
)).toDF("text")
df.withColumn("uuid", stringToUUIDUdf(df("text"))).show(false)
结果:
+--------------------------------+------------------------------------+
|text |uuid |
+--------------------------------+------------------------------------+
|7158e7a4c1284697bcab58dfb8c80e66|7158e7a4-c128-4697-bcab-58dfb8c80e66|
|cf251f4c667c46b3a9f67681f3be2338|cf251f4c-667c-46b3-a9f6-7681f3be2338|
|42d3ee515d8c4268b47b579170c88e4c|42d3ee51-5d8c-4268-b47b-579170c88e4c|
|6b7e3222292d4dc5a8a369f7fede7dc4|6b7e3222-292d-4dc5-a8a3-69f7fede7dc4|
|b371896d39d04fbb8a8646a176e60d17|b371896d-39d0-4fbb-8a86-46a176e60d17|
|e2b57f1677154c5bbe181a575aba4684|e2b57f16-7715-4c5b-be18-1a575aba4684|
|2a2e11c4cc604673bbd13b22f029dabb|2a2e11c4-cc60-4673-bbd1-3b22f029dabb|
|fcad3f649a114336a721fc3eaefd6ce1|fcad3f64-9a11-4336-a721-fc3eaefd6ce1|
|f3f6fcfd16394e1e9c98aae0bd062432|f3f6fcfd-1639-4e1e-9c98-aae0bd062432|
|8b0e1929e335489997bfca20bb021d62|8b0e1929-e335-4899-97bf-ca20bb021d62|
+--------------------------------+------------------------------------+
答案 1 :(得分:1)
使用以下逻辑使其正常工作。
依赖性:
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>bson</artifactId>
<version>3.4.2</version>
</dependency>
功能:
def test(uuids : String): Binary ={
val uuid = UUID.fromString(uuids)
val holder = new BsonDocument
val writer = new BsonDocumentWriter(holder)
writer.writeStartDocument()
writer.writeName("uuid")
new UuidCodec(UuidRepresentation.STANDARD).encode(writer, uuid,
EncoderContext.builder().build())
writer.writeEndDocument()
val bsonBinary = holder.getBinary("uuid");
val test2= new Binary(bsonBinary.getType(), bsonBinary.getData());
return test2
}