在Spark SQL中连接表时添加新列

时间:2017-07-15 08:18:43

标签: apache-spark apache-spark-sql

dateframe 1:crimedf

scala> crimedf.show(5,false)
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----------------------+---------------------+-----------------+--------------------------+---+
|lat       |lng        |desc                                                                               |zip  |title                  |timeStamp            |twp              |addr                      |e  |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----------------------+---------------------+-----------------+--------------------------+---+
|40.2978759|-75.5812935|REINDEER CT & DEAD END;  NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;          |19525|EMS: BACK PAINS/INJURY |2015-12-10 17:40:00.0|NEW HANOVER      |REINDEER CT & DEAD END    |1  |
|40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN;  HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS: DIABETIC EMERGENCY|2015-12-10 17:40:00.0|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1  |
|40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27;                         |19401|Fire: GAS-ODOR/LEAK    |2015-12-10 17:40:00.0|NORRISTOWN       |HAWS AVE                  |1  |
|40.116153 |-75.343513 |AIRY ST & SWEDE ST;  NORRISTOWN; Station 308A; 2015-12-10 @ 16:47:36;              |19401|EMS: CARDIAC EMERGENCY |2015-12-10 17:40:01.0|NORRISTOWN       |AIRY ST & SWEDE ST        |1  |
|40.251492 |-75.6033497|CHERRYWOOD CT & DEAD END;  LOWER POTTSGROVE; Station 329; 2015-12-10 @ 16:56:52;   |null |EMS: DIZZINESS         |2015-12-10 17:40:01.0|LOWER POTTSGROVE |CHERRYWOOD CT & DEAD END  |1  |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----------------------+---------------------+-----------------+--------------------------+---+
only showing top 5 rows

crimedf.registerTempTable( “crimedf”)

dataframe 2:zipcode

scala> zipcode.show(5)
+---+----------+-----+---------+----------+--------+---+
|zip|      city|state| latitude| longitude|timezone|dst|
+---+----------+-----+---------+----------+--------+---+
|210|Portsmouth|   NH|43.005895|-71.013202|      -5|  1|
|211|Portsmouth|   NH|43.005895|-71.013202|      -5|  1|
|212|Portsmouth|   NH|43.005895|-71.013202|      -5|  1|
|213|Portsmouth|   NH|43.005895|-71.013202|      -5|  1|
|214|Portsmouth|   NH|43.005895|-71.013202|      -5|  1|
+---+----------+-----+---------+----------+--------+---+

zipcode.registerTemptable( “邮编”)

  

我的要求是:

     
      
  1. 通过从“title of table”crimedf“

  2. 列中的”:“之前提取子字符串来创建新列”问题“   
  3. 加入2个表并将列“状态”和“问题”分组并生成计数。

  4.   

当我从第一个表生成一个新表并将其与第二个表连接时,我得到了所需的输出。

scala> val newcrimedf = sqlContext.sql("select substring_index(title,':',1) as problem, zip from crimedf")
newcrimedf: org.apache.spark.sql.DataFrame = [problem: string, zip: int]

scala> newcrimedf.show(2)
+-------+-----+
|problem|  zip|
+-------+-----+
|    EMS|19525|
|    EMS|19446|
+-------+-----+

newcrimedf.registerTempTable( “newcrimedf”)

sqlContext.sql("select z.state, n.problem, count(*) as count 
from newcrimedf n 
JOIN zipcode z 
ON n.zip = z.zip 
GROUP BY z.state,n.problem 
ORDER BY count DESC").show
+-----+-------+-----+                                                           
|state|problem|count|
+-----+-------+-----+
|   PA|    EMS|44326|
|   PA|Traffic|29297|
|   PA|   Fire|13012|
|   AL|Traffic|    1|
|   TX|    EMS|    1|
+-----+-------+-----+

如何在不创建第二个表“newcrimedf”的情况下从原始第一个表(“crimedf”)生成相同的输出?

如何在加入时添加新列? 请帮助。

我尝试过这样做,但错了。 以下是我的尝试:

sqlContext.sql("select z.state, c.problem, count(*) as count from 
(select zip, substring(title,':',1) problem from crimedf) c 
JOIN zipcode z ON c.zip = z.zip 
GROUP BY z.state,c.problem ORDER BY count desc").show
+-----+-------+-----+                                                           
|state|problem|count|
+-----+-------+-----+
|   PA|   null|86635|
|   TX|   null|    1|
|   AL|   null|    1|
+-----+-------+-----+

1 个答案:

答案 0 :(得分:0)

  
    

创建一个新列"问题"通过在"之前提取子字符串:"来自专栏"表格标题" crimedf"

  

可以使用withColumn api和简单split函数获取此功能(请参阅下面的代码)

  
    

加入2个表并对列进行分组" state"和"问题"并生成计数。

  

可以使用joingroupBycount 聚合来实现这些目标(参见下面的代码)

以下代码应该适合您

crimedf.select("zip", "title")                    //selecting needed columns from crimedf
  .withColumn("problem", split($"title", ":")(0)) //generating problem column by splitting title column
  .join(zipcode, Seq("zip"))                      // joining with zipcode dataframe with zip column
  .groupBy("state", "problem")                   //grouping by state and problem
  .agg(count("state"))                           //counting the grouped data
  .show(false)

<强>被修改

你的SQL查询工作完美,并提供与上面使用的api相同的结果。您忘了在_index

中添加substring
sqlContext.sql("""select z.state, c.problem, count(*) as count from
              (select zip, substring_index(title,':',1) as problem from crimedf) c
                 JOIN zipcode z ON c.zip = z.zip
                 GROUP BY z.state,c.problem ORDER BY count desc""").show(false)