Pattern matching with regular expression in spark dataframes using spark-shell

时间:2016-10-20 12:36:50

标签: regex scala apache-spark

Suppose we are given dataset ("DATA") like :

YEAR | FIRST NAME | LAST NAME | VARIABLES
2008 | JOY        | ANDERSON  | spark|python|scala; 45;w/o sports;w datascience
2008 | STEVEN     | JOHNSON   | Spark|R; 90|56
2006 | NIHA       | DIVA      | w/o sports

and we have another dataset ("RESULT") like :

YEAR | FIRST NAME | LAST NAME 
1992 | EMMA       | CENA 
2008 | JOY        | ANDERSON
2008 | STEVEN     | ANDERSON
2006 | NIHA       | DIVA
and so on.

The output should be ("RESULT") :

YEAR | FIRST NAME | LAST NAME | SUBJECT | SCORE | SPORTS | DATASCIENCE
1992 | EMMA       | CENA      |         |       |        |              
2008 | JOY        | ANDERSON  | SPARK   | 45    | FALSE  | TRUE
2008 | JOY        | ANDERSON  | PYTHON  | 45    | FALSE  | TRUE
2008 | JOY        | ANDERSON  | SCALA   | 45    | FALSE  | TRUE
2008 | STEVEN     | ANDERSON  |         |       |        | 
2006 | NIHA       | DIVA      |         |       | FALSE  | 
2008 | STEVEN     | JOHNSON   | SPARK   | 90    |        |
2008 | STEVEN     | JOHNSON   | SPARK   | 56    |        |
2008 | STEVEN     | JOHNSON   | R       | 90    |        |
2008 | STEVEN     | JOHNSON   | R       | 56    |        |
and so on. 

Please note that there are some rows in DATA which are not present in RESULT and vice-versa. For eg - "2008,STEVEN,JOHNSON" is not present in RESULT but is present in DATA. And the entries should be made in RESULT dataset. The columns {SUBJECT, SCORE, SPORTS, DATASCIENCE} are made by my intuition that "spark" refers to the SUBJECT and so on. Hope you understand my query. And I am using spark-shell with spark dataframes. Note that "Spark" and "spark" should be considered as same.

1 个答案:

答案 0 :(得分:1)

正如评论中所解释的那样,您可以实现一些棘手的逻辑,如splitting row in multiple row in spark-shell的答案

数据:

val df = List(
("2008","JOY       ","ANDERSON ","spark|python|scala;45;w/o sports;w datascience"),
("2008","STEVEN    ","JOHNSON  ","Spark|R;90|56"),
("2006","NIHA      ","DIVA     ","w/o sports")
).toDF("YEAR","FIRST NAME","LAST NAME","VARIABLE")

我只强调相对棘手的部分,你可以自己弄清楚细节。我建议处理" w"和" w / o"标签分开。此外,你必须在单独的" sql"中爆炸语言。声明。这给了

val step1 = df.withColumn("backrefReplace",split(regexp_replace('VARIABLE,"^([A-z|]+)?;?([\\d\\|]+)?;?(w.*)?$","$1"+sep+"$2"+sep+"$3"),sep))
  .withColumn("letter",explode(split('backrefReplace(0),"\\|")))
  .select('YEAR,$"FIRST NAME",$"LAST NAME",'VARIABLE,'letter,
    explode(split('backrefReplace(1),"\\|")).as("digits"),
    'backrefReplace(2).as("tags")
  )

给出了

scala> step1.show(false)
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|YEAR|FIRST NAME|LAST NAME|VARIABLE                                      |letter|digits|tags                    |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|2008|JOY       |ANDERSON |spark|python|scala;45;w/o sports;w datascience|spark |45    |w/o sports;w datascience|
|2008|JOY       |ANDERSON |spark|python|scala;45;w/o sports;w datascience|python|45    |w/o sports;w datascience|
|2008|JOY       |ANDERSON |spark|python|scala;45;w/o sports;w datascience|scala |45    |w/o sports;w datascience|
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |Spark |90    |                        |
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |Spark |56    |                        |
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |R     |90    |                        |
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |R     |56    |                        |
|2006|NIHA      |DIVA     |w/o sports                                    |      |      |w/o sports              |
+----+----------+---------+----------------------------------------------+------+------+------------------------+

然后你必须处理大写和标记。对于代码,您可以使用explodepivot获得相对通用的代码,但是您必须进行一些清洁以匹配您的确切结果。这是一个例子:

List(("a;b;c")).toDF("str")
  .withColumn("char",explode(split('str,";")))
  .groupBy('str)
  .pivot("char")
  .count
  .show()

+-----+---+---+---+
|  str|  a|  b|  c|
+-----+---+---+---+
|a;b;c|  1|  1|  1|
+-----+---+---+---+

详细了解pivot here

最后一步只是在第二个数据集上进行左连接(第一个"结果")。