dateframe 1:crimedf
scala> crimedf.show(5,false)
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----------------------+---------------------+-----------------+--------------------------+---+
|lat |lng |desc |zip |title |timeStamp |twp |addr |e |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----------------------+---------------------+-----------------+--------------------------+---+
|40.2978759|-75.5812935|REINDEER CT & DEAD END; NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52; |19525|EMS: BACK PAINS/INJURY |2015-12-10 17:40:00.0|NEW HANOVER |REINDEER CT & DEAD END |1 |
|40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS: DIABETIC EMERGENCY|2015-12-10 17:40:00.0|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1 |
|40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27; |19401|Fire: GAS-ODOR/LEAK |2015-12-10 17:40:00.0|NORRISTOWN |HAWS AVE |1 |
|40.116153 |-75.343513 |AIRY ST & SWEDE ST; NORRISTOWN; Station 308A; 2015-12-10 @ 16:47:36; |19401|EMS: CARDIAC EMERGENCY |2015-12-10 17:40:01.0|NORRISTOWN |AIRY ST & SWEDE ST |1 |
|40.251492 |-75.6033497|CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; Station 329; 2015-12-10 @ 16:56:52; |null |EMS: DIZZINESS |2015-12-10 17:40:01.0|LOWER POTTSGROVE |CHERRYWOOD CT & DEAD END |1 |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----------------------+---------------------+-----------------+--------------------------+---+
only showing top 5 rows
crimedf.registerTempTable( “crimedf”)
dataframe 2:zipcode
scala> zipcode.show(5)
+---+----------+-----+---------+----------+--------+---+
|zip| city|state| latitude| longitude|timezone|dst|
+---+----------+-----+---------+----------+--------+---+
|210|Portsmouth| NH|43.005895|-71.013202| -5| 1|
|211|Portsmouth| NH|43.005895|-71.013202| -5| 1|
|212|Portsmouth| NH|43.005895|-71.013202| -5| 1|
|213|Portsmouth| NH|43.005895|-71.013202| -5| 1|
|214|Portsmouth| NH|43.005895|-71.013202| -5| 1|
+---+----------+-----+---------+----------+--------+---+
zipcode.registerTemptable( “邮编”)
我的要求是:
- 列中的”:“之前提取子字符串来创建新列”问题“
通过从“title of table”crimedf“
- 醇>
加入2个表并将列“状态”和“问题”分组并生成计数。
当我从第一个表生成一个新表并将其与第二个表连接时,我得到了所需的输出。
scala> val newcrimedf = sqlContext.sql("select substring_index(title,':',1) as problem, zip from crimedf")
newcrimedf: org.apache.spark.sql.DataFrame = [problem: string, zip: int]
scala> newcrimedf.show(2)
+-------+-----+
|problem| zip|
+-------+-----+
| EMS|19525|
| EMS|19446|
+-------+-----+
newcrimedf.registerTempTable( “newcrimedf”)
sqlContext.sql("select z.state, n.problem, count(*) as count
from newcrimedf n
JOIN zipcode z
ON n.zip = z.zip
GROUP BY z.state,n.problem
ORDER BY count DESC").show
+-----+-------+-----+
|state|problem|count|
+-----+-------+-----+
| PA| EMS|44326|
| PA|Traffic|29297|
| PA| Fire|13012|
| AL|Traffic| 1|
| TX| EMS| 1|
+-----+-------+-----+
如何在不创建第二个表“newcrimedf”的情况下从原始第一个表(“crimedf”)生成相同的输出?
如何在加入时添加新列? 请帮助。
我尝试过这样做,但错了。 以下是我的尝试:
sqlContext.sql("select z.state, c.problem, count(*) as count from
(select zip, substring(title,':',1) problem from crimedf) c
JOIN zipcode z ON c.zip = z.zip
GROUP BY z.state,c.problem ORDER BY count desc").show
+-----+-------+-----+
|state|problem|count|
+-----+-------+-----+
| PA| null|86635|
| TX| null| 1|
| AL| null| 1|
+-----+-------+-----+
答案 0 :(得分:0)
创建一个新列"问题"通过在"之前提取子字符串:"来自专栏"表格标题" crimedf"
可以使用withColumn
api和简单split
函数获取此功能(请参阅下面的代码)
加入2个表并对列进行分组" state"和"问题"并生成计数。
可以使用join
,groupBy
和count
聚合来实现这些目标(参见下面的代码)
以下代码应该适合您
crimedf.select("zip", "title") //selecting needed columns from crimedf
.withColumn("problem", split($"title", ":")(0)) //generating problem column by splitting title column
.join(zipcode, Seq("zip")) // joining with zipcode dataframe with zip column
.groupBy("state", "problem") //grouping by state and problem
.agg(count("state")) //counting the grouped data
.show(false)
<强>被修改强>
你的SQL查询工作完美,并提供与上面使用的api相同的结果。您忘了在_index
substring
sqlContext.sql("""select z.state, c.problem, count(*) as count from
(select zip, substring_index(title,':',1) as problem from crimedf) c
JOIN zipcode z ON c.zip = z.zip
GROUP BY z.state,c.problem ORDER BY count desc""").show(false)