我正在使用Spark(2.0)开发Spark SQL并使用Java API读取CSV。
在CSV文件中有双引号/分隔列。例如:"Express Air,Delivery Truck"
读取CSV并返回数据集的代码:
Dataset<Row> df = spark.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load(filename)
结果:
+-----+-----------------------+--------------------------+
|Year | State | Ship Mode |...
+-----+-----------------------+--------------------------+
|2012 |New York/California |Express Air/Delivery Truck|...
|2013 |Nevada/Texas |Delivery Truck |...
|2014 |North Carolina/Kentucky|Regular Air/Delivery Truck|...
+-----+-----------------------+--------------------------+
但是,我想将State
和Shop Mode
拆分为Mode
列并作为数据集返回,并希望它保留其订单。 ex){New York,Express Air} {California,Delivery Truck}
+-----+--------------------------+
|Year | Mode |
+-----+--------------------------+
|2012 |New York,Express Air |
|2012 |California,Delivery Truck |
|2013 |Nevada,Delivery Truck |
|2013 |Texas,Delivery Truck |
|2014 |North Carolina,Regular Air|
|2014 |Kentucky,Delivery Truck |
+-----+--------------------------+
有什么方法可以使用Java Spark做到这一点吗?
答案 0 :(得分:0)
以下是Spark SQL方法:
df.createOrReplaceTempView("tab")
val q = """
with m as (
select year, explode(split(State, "/")) as State, row_number() over(order by year) as rn from tab
), s as (
select year, explode(split(`Ship Mode`, "/")) as Mode, row_number() over(order by year) as rn from tab
)
select m.year, m.State, s.Mode
from m
join s
on m.year = s.year and m.rn = s.rn
"""
spark.sql(q).show
结果:
scala> spark.sql(q).show
+----+--------------+--------------+
|year| State| Mode|
+----+--------------+--------------+
|2012| New York| Express Air|
|2012| California|Delivery Truck|
|2013| Nevada|Delivery Truck|
|2014|North Carolina|Delivery Truck|
+----+--------------+--------------+
如果需要,您可以轻松地连接列:
val q = """
with m as (
select year, explode(split(State, "/")) as State, row_number() over(order by year) as rn from tab
), s as (
select year, explode(split(`Ship Mode`, "/")) as Mode, row_number() over(order by year) as rn from tab
)
select m.year, concat(m.State, ',', s.Mode) as Mode
from m
join s
on m.year = s.year and m.rn = s.rn
"""
结果:
scala> spark.sql(q).show(false)
+----+-----------------------------+
|year|Mode |
+----+-----------------------------+
|2012|New York,Express Air |
|2012|California,Delivery Truck |
|2013|Nevada,Delivery Truck |
|2014|North Carolina,Delivery Truck|
+----+-----------------------------+
PS我使用过Scala,但对Java来说应该差不多......
答案 1 :(得分:0)
是的,可以使用几个步骤。
第1步 ds1&lt; - year | concat(状态第1部分,模式的第1部分)
第2步:ds2&lt; - year | concat(状态第2部分,第2部分模式)
步骤3 ds3&lt; - ds1联合所有ds2按年订单
应该做的工作