如何使用Spark 2屏蔽列?

时间:2017-09-20 21:04:10

标签: scala apache-spark apache-spark-sql apache-spark-2.0

我有一些表需要屏蔽它的一些列。要屏蔽的列因表格而异,我正在从application.conf文件中读取这些列。

例如,对于employee表,如下所示

+----+------+-----+---------+
| id | name | age | address |
+----+------+-----+---------+
| 1  | abcd | 21  | India   |
+----+------+-----+---------+
| 2  | qazx | 42  | Germany |
+----+------+-----+---------+

如果我们想掩盖名称和年龄列,那么我会按顺序获取这些列。

val mask = Seq("name", "age")

屏蔽后的预期值为:

+----+----------------+----------------+---------+
| id | name           | age            | address |
+----+----------------+----------------+---------+
| 1  | *** Masked *** | *** Masked *** | India   |
+----+----------------+----------------+---------+
| 2  | *** Masked *** | *** Masked *** | Germany |
+----+----------------+----------------+---------+

如果我有一个员工表数据框,那么屏蔽这些列的方法是什么?

如果我有payment表,如下所示,并希望屏蔽namesalary列,那么我会在序列中获得掩码列

+----+------+--------+----------+
| id | name | salary | tax_code |
+----+------+--------+----------+
| 1  | abcd | 12345  | KT10     |
+----+------+--------+----------+
| 2  | qazx | 98765  | AD12d    |
+----+------+--------+----------+
val mask = Seq("name", "salary")

我尝试了类似mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )之类的内容,但它没有返回任何内容。

感谢@philantrovert,我找到了解决方案。这是我使用的解决方案:

def maskData(base: DataFrame, maskColumns: Seq[String]) = {
    val maskExpr = base.columns.map { col => if(maskColumns.contains(col)) s"'*** Masked ***' as ${col}" else col }
    base.selectExpr(maskExpr: _*)
}

3 个答案:

答案 0 :(得分:5)

最简单,最快捷的方法是使用withColumn,只需使用"*** Masked ***"覆盖列中的值即可。使用您的小示例数据框

val df = spark.sparkContext.parallelize( Seq (
  (1, "abcd", 12345, "KT10" ),
  (2, "qazx", 98765, "AD12d")
)).toDF("id", "name", "salary", "tax_code")

如果要使用已知名称屏蔽少量列,则可以执行以下操作:

val mask = Seq("name", "salary")

df.withColumn("name", lit("*** Masked ***"))
  .withColumn("salary", lit("*** Masked ***"))

否则,您需要创建一个循环:

var df2 = df
for (col <- mask){
  df2 = df2.withColumn(col, lit("*** Masked ***"))
}

这两种方法都会给你这样的结果:

+---+--------------+--------------+--------+
| id|          name|        salary|tax_code|
+---+--------------+--------------+--------+
|  1|*** Masked ***|*** Masked ***|    KT10|
|  2|*** Masked ***|*** Masked ***|   AD12d|
+---+--------------+--------------+--------+

答案 1 :(得分:2)

请检查以下代码。关键是udf函数。

val df = ss.sparkContext.parallelize( Seq (
  ("c1", "JAN-2017", 49 ),
  ("c1", "MAR-2017", 83),
)).toDF("city", "month", "sales")
df.show()

val mask = udf( (s : String) => {
  "*** Masked ***"
})

df.withColumn("city", mask($"city")).show`

答案 2 :(得分:1)

您的陈述

mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )

会返回一个听起来不太好的List[org.apache.spark.sql.DataFrame]

您可以使用selectExpr并使用以下内容生成regexp_replace表达式:

base.show
+---+----+-----+-------+
| id|name|  age|address|
+---+----+-----+-------+
|  1|abcd|12345|  KT10 |
|  2|qazx|98765|  AD12d|
+---+----+-----+-------+

val mask = Seq("name", "age")
val expr = df.columns.map { col =>
   if (mask.contains(col) ) s"""regexp_replace(${col}, "^.*", "** Masked **" ) as ${col}"""
   else col
 }

这将为序列mask中存在的列生成带有regex_replace的表达式

Array[String] = Array(id, regexp_replace(name, "^.*", "** Masked **" ) as name, regexp_replace(age, "^.*", "** Masked **" ) as age, address)

现在,您可以在生成的序列

上使用selectExpr
base.selectExpr(expr: _*).show

+---+------------+------------+-------+
| id|        name|         age|address|
+---+------------+------------+-------+
|  1|** Masked **|** Masked **|  KT10 |
|  2|** Masked **|** Masked **|  AD12d|
+---+------------+------------+-------+