Question

我有一个数据框：yearDF，它是通过读取RDBMS表获得的，如下所示：

val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
                                                   .option("dbtable", s"(${query}) as year2017")
                                                   .option("user", devUserName)
                                                   .option("password", devPassword)
                                                   .option("numPartitions",15)
                                                   .load()

在我们的项目中，我的架构师说，我们需要对从RDBMS表读取的任何数据应用REGEX模式，然后再将其持久化/加载到HDFS的Hive表中。这是我必须使用的正则表达式模式：

"regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(%s, E'[\\\\n]+', ' ', 'g' ), E'[\\\\r]+', ' ', 'g' ), E'[\\\\t]+', ' ', 'g' ), E'[\\\\cA]+', ' ', 'g' ), E'[\\\\ca]+', ' ', 'g' ) as %s"

任何人都可以让我知道如何在数据框yearDF的所有列上应用上述正则表达式模式吗？

Answer 1

yearDF.columns将返回一个Array[String]，其中包含yearDF的所有列

map在其上以获得字符串表达式。使用字符串函数.format用列名替换说明符%s。

val regexExpr = yearDF.columns.map(c => "regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(%s, E'[\\\\n]+', ' ', 'g' ), E'[\\\\r]+', ' ', 'g' ), E'[\\\\t]+', ' ', 'g' ), E'[\\\\cA]+', ' ', 'g' ), E'[\\\\ca]+', ' ', 'g' ) as %s".format(c ,c))

将生成的表达式传递到selectExpr：

yearDF.selectExpr(regexExpr : _*)

如何在scala中的数据帧上应用正则表达式模式？

1 个答案: