如何在Scala / Spark数据框中的每一行使用带有条件的withColumn

时间:2018-04-08 17:25:26

标签: scala apache-spark apache-spark-sql

我的数据框格式低于

+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+
|DataPartition    |TimeStamp                |FFAction|!||IdentifierValue_effectiveFrom|IdentifierValue_effectiveTo|IdentifierValue_identifierEntityId|IdentifierValue_identifierEntityTypeId|IdentifierValue_identifierTypeId|
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+
|SelfSourcedPublic|2018-03-05T11:54:18+00:00|I|!|       |1900-01-01T00:00:00+00:00    |9999-12-31T00:00:00+00:00  |4295903126                        |404010                                |320150                          |
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+

我想在下面的列

添加条件的额外列
IdentifierValue_identifierEntityTypeId

使用以下条件添加额外的列分区

  

如果IdentifierValue_identifierEntityTypeId   = 1001371402 then partition = Repno2FundamentalSeries else if IdentifierValue_identifierEntityTypeId404010 then partition =   Repno2Organization

这就是我想要实现的目标

 val temp = temp1.withColumn("Partition", when($"IdentifierValue_identifierEntityTypeId" === "404010", 0).otherwise("Repno2FundamentalSeries"))
    temp.show(false)

我的输出低于输出值,但值为零

+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+---------+
|DataPartition    |TimeStamp                |FFAction|!||IdentifierValue_effectiveFrom|IdentifierValue_effectiveTo|IdentifierValue_identifierEntityId|IdentifierValue_identifierEntityTypeId|IdentifierValue_identifierTypeId|Partition|
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+---------+
|SelfSourcedPublic|2018-03-05T11:54:18+00:00|I|!|       |1900-01-01T00:00:00+00:00    |9999-12-31T00:00:00+00:00  |4295903126                        |404010                                |320150                          |0        |
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+---------+

我是scala的新手,因此提出了基本问题

对于列上的多个条件如何写入和否则。 这不适合我。像

这样的错误
  

线程“main”中的异常java.lang.IllegalArgumentException:   否则()只能在先前生成的列上应用一次   by when()

val dataMain = dataMain1.withColumn(
      "Partition",
      when($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "Fundamental", "Instrument2Fundamental")
        .otherwise(when($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Instrument2FundamentalSeries"))
        .otherwise(when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "Fundamental", "Organization2Fundamental"))
        .otherwise(when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Organization2FundamentalSeries"))
        )

2 个答案:

答案 0 :(得分:2)

根据您提供的条件,您应该更改when条件,如下所示。

  

如果IdentifierValue_identifierEntityTypeId = 1001371402则分区   = Repno2FundamentalSeries else if IdentifierValue_identifierEntityTypeId404010 then partition =   Repno2Organization

"arr.0" : bson.M{"$exists": true}

输出:

df1.withColumn("Partition",
  when($"IdentifierValue_identifierEntityTypeId" === "1001371402", "Repno2FundamentalSeries")
    .otherwise("Repno2Organization")
)

修改

以下是编写嵌套+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+-----------------------+ |DataPartition |TimeStamp |FFAction|!||IdentifierValue_effectiveFrom|IdentifierValue_effectiveTo|IdentifierValue_identifierEntityId|IdentifierValue_identifierEntityTypeId|IdentifierValue_identifierTypeId|Partition | +-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+-----------------------+ |SelfSourcedPublic|2018-03-05T11:54:18+00:00|I||! |1900-01-01T00:00:00+00:00 |9999-12-31T00:00:00+00:00 |4295903126 |404010 |320150 |Repno2FundamentalSeries| +-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+-----------------------+

的方法
When

希望这有帮助

答案 1 :(得分:0)

实现这一目标的另一种方法是:您可以使用CASE WHEN语句之类的SQL而不是使用WithColumn

如果您熟悉sql

,这可能更容易编码

例如

       val dataMain = dataMain1.selectExpr("*", 
       """CASE WHEN RelationObjectId_relatedObjectType = 'EDInstrument' 
       THEN 'Instrument2Fundamental'
       WHEN cond2 
       THEN value2
       ELSE defaultValue end AS partition""")