我对此很新。我的问题是:对于案例类case class testclass(date_key: String , amount: Int, type:String, condition1:String, condition2: String)
在Dataframe df
中,当我的amount
type:String
,按condition1=condition2
分组。
我正在尝试定义一个函数但是我应该怎么做呢?非常感谢!
`def sumAmount (t: testclass): Int = {
if (condition1==condition2)
{
} else {
"na"
}
}`
答案 0 :(得分:2)
我假设您已使用#co-page-orderlist {
height: 500px;
max-height: 500px;
overflow: scroll;
}
dataframe
case class
出于测试目的,我创建了一个测试case class testclass(date_key: String , amount: Int, types: String, condition1: String, condition2: String)
dataframe
应该给你
import sqlContext.implicits._
val df = Seq(
testclass("2015-01-01", 332, "types", "condition1", "condition1"),
testclass("2015-01-01", 332, "types", "condition1", "condition1"),
testclass("2015-01-01", 332, "types", "condition1", "condition2"),
testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
testclass("2015-01-01", 332, "types2", "condition1", "condition2")
).toDF
现在,您希望在+----------+------+------+----------+----------+
|date_key |amount|types |condition1|condition2|
+----------+------+------+----------+----------+
|2015-01-01|332 |types |condition1|condition1|
|2015-01-01|332 |types |condition1|condition1|
|2015-01-01|332 |types |condition1|condition2|
|2015-01-01|332 |types2|condition1|condition1|
|2015-01-01|332 |types2|condition1|condition1|
|2015-01-01|332 |types2|condition1|condition1|
|2015-01-01|332 |types2|condition1|condition2|
+----------+------+------+----------+----------+
时groupBy
types
列和sum
amount
。为此,您condition1 = condtion2
仅filter
condition1=condition2
和groupBy
aggregation
的行sum
df.filter($"condition1" === $"condition2")
.groupBy("types")
.agg(sum("amount").as("sum"))
.show(false)
你应该有所需的结果
+------+---+
|types |sum|
+------+---+
|types |664|
|types2|996|
+------+---+
<强>更新强>
如果您想使用dataSet
代替dataframe
,可以使用.toDS
代替.toDF
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class testclass(date_key: String , amount: Int, types: String, condition1: String, condition2: String)
defined class testclass
scala> val ds = Seq(
| testclass("2015-01-01", 332, "types", "condition1", "condition1"),
| testclass("2015-01-01", 332, "types", "condition1", "condition1"),
| testclass("2015-01-01", 332, "types", "condition1", "condition2"),
| testclass("2015-01-01", 332, "types2", "condition1", "condition1"),
| testclass("2015-01-01", 332, "types2", "condition1", "condition2")
| ).toDS
ds: org.apache.spark.sql.Dataset[testclass] = [date_key: string, amount: int ... 3 more fields]
您可以看到它是dataset
而不是dataframe
其余步骤如上所述。
答案 1 :(得分:0)
data.condition1.equals(data.condition2)
groupBy
数据类型,它将dataType作为键,将case类列表作为值示例(无火花)
case class MyData(dataKey: String, amount: Int, dataType: String, condition1: String, condition2: String)
val grouped = List(MyData("a", 1000, "type1", "matches1", "matches1"),
MyData("b", 1000, "type1", "matches1", "matches1"),
MyData("c", 1000, "type1", "matches1", "matches2"),
MyData("d", 1000, "type2", "matches1", "matches1")
).filter(data => data.condition1.equals(data.condition2))
.groupBy(_.dataType)
.map{ case (dataType, values) =>
dataType -> values.map(_.amount).sum
}.toMap
grouped("type1") shouldBe 2000
grouped("type2") shouldBe 1000