在数据集<row>上拆分字符串列,并在数据集<row>

时间:2017-11-03 09:02:44

标签: java sql apache-spark dataset apache-spark-sql

我正在使用Spark(2.0)开发Spark SQL并使用Java API读取CSV。

在CSV文件中有双引号/分隔列。例如:"Express Air,Delivery Truck"

读取CSV并返回数据集的代码:

Dataset<Row> df = spark.read()
                .format("com.databricks.spark.csv")
                .option("inferSchema", "true")
                .option("header", "true")
                .load(filename) 

结果:

+-----+-----------------------+--------------------------+
|Year |       State           |                Ship Mode |...
+-----+-----------------------+--------------------------+
|2012 |New York/California    |Express Air/Delivery Truck|...
|2013 |Nevada/Texas           |Delivery Truck            |...
|2014 |North Carolina/Kentucky|Regular Air/Delivery Truck|...
+-----+-----------------------+--------------------------+

但是,我想将StateShop Mode拆分为Mode列并作为数据集返回,并希望它保留其订单。 ex){New York,Express Air} {California,Delivery Truck}

+-----+--------------------------+
|Year |      Mode                |   
+-----+--------------------------+
|2012 |New York,Express Air      |
|2012 |California,Delivery Truck |
|2013 |Nevada,Delivery Truck     |
|2013 |Texas,Delivery Truck      |
|2014 |North Carolina,Regular Air|
|2014 |Kentucky,Delivery Truck   |
+-----+--------------------------+

有什么方法可以使用Java Spark做到这一点吗?

2 个答案:

答案 0 :(得分:0)

以下是Spark SQL方法:

df.createOrReplaceTempView("tab")

val q = """
with m as (
  select year, explode(split(State, "/")) as State, row_number() over(order by year) as rn from tab
), s as (
  select year, explode(split(`Ship Mode`, "/")) as Mode, row_number() over(order by year) as rn from tab
)
select m.year, m.State, s.Mode
from m
join s
  on m.year = s.year and m.rn = s.rn
"""

spark.sql(q).show

结果:

scala> spark.sql(q).show
+----+--------------+--------------+
|year|         State|          Mode|
+----+--------------+--------------+
|2012|      New York|   Express Air|
|2012|    California|Delivery Truck|
|2013|        Nevada|Delivery Truck|
|2014|North Carolina|Delivery Truck|
+----+--------------+--------------+

如果需要,您可以轻松地连接列:

val q = """
with m as (
  select year, explode(split(State, "/")) as State, row_number() over(order by year) as rn from tab
), s as (
  select year, explode(split(`Ship Mode`, "/")) as Mode, row_number() over(order by year) as rn from tab
)
select m.year, concat(m.State, ',', s.Mode) as Mode
from m
join s
  on m.year = s.year and m.rn = s.rn
"""

结果:

scala> spark.sql(q).show(false)
+----+-----------------------------+
|year|Mode                         |
+----+-----------------------------+
|2012|New York,Express Air         |
|2012|California,Delivery Truck    |
|2013|Nevada,Delivery Truck        |
|2014|North Carolina,Delivery Truck|
+----+-----------------------------+

PS我使用过Scala,但对Java来说应该差不多......

答案 1 :(得分:0)

是的,可以使用几个步骤。

第1步 ds1&lt; - year | concat(状态第1部分,模式的第1部分)

第2步:ds2&lt; - year | concat(状态第2部分,第2部分模式)

步骤3 ds3&lt; - ds1联合所有ds2按年订单

应该做的工作