当我像这样使用DataFrame groupby时:
df.groupBy(df("age")).agg(Map("id"->"count"))
我只会获得一个包含“age”和“count(id)”列的DataFrame,但在df中,还有许多其他列,例如“name”。
总而言之,我希望得到MySQL中的结果,
“按年龄从df组中选择姓名,年龄,计数(id)”
在Spark中使用groupby时我该怎么办?
答案 0 :(得分:23)
长话短说一般,你必须将聚合结果与原始表连接起来。 Spark SQL遵循与大多数主要数据库(PostgreSQL,Oracle,MS SQL Server)相同的pre-SQL:1999约定,它不允许在聚合查询中添加其他列。
由于计数结果等聚合定义不明确且行为在支持此类查询的系统中有所不同,因此您可以使用first
或last
之类的任意聚合包含其他列。
在某些情况下,您可以使用agg
替换select
窗口函数和后续where
,但根据上下文,它可能非常昂贵。
答案 1 :(得分:8)
在执行groupBy之后获取所有列的一种方法是使用join函数。
feature_group = ['name', 'age']
data_counts = df.groupBy(feature_group).count().alias("counts")
data_joined = df.join(data_counts, feature_group)
data_joined现在将包含所有列,包括计数值。
答案 2 :(得分:1)
#solved #working 解决方案
借助@Azmisov 在此线程中的评论生成此解决方案 和代码示例取自 https://sparkbyexamples.com/spark/using-groupby-on-dataframe/
问题:在使用数据帧的 spark scala 中,当使用 groupby 和 max 时,它返回一个数据帧,其中包含仅在 groupby 和 max 中使用的列。如何获取所有列?或者可以说如何获得 not groupby 列?
解决方案:请查看完整示例以获取带有 groupby 和 max 的所有列
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._ //{col, lit, when, to_timestamp}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
val spark = SparkSession
.builder()
.appName("app-name")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
在生成 df 时给出以下输出。
output :
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
下面的代码给出了列名不合适但仍然可以使用的输出:
val dfwithmax = df.groupBy("department").agg(max("salary"), first("employee_name"), first("state"), first("age"), first("bonus"))
dfwithmax.show()
+----------+-----------+---------------------------+-------------------+-----------------+-------------------+
|department|max(salary)|first(employee_name, false)|first(state, false)|first(age, false)|first(bonus, false)|
+----------+-----------+---------------------------+-------------------+-----------------+-------------------+
| Sales| 90000| James| NY| 34| 10000|
| Finance| 99000| Maria| CA| 24| 23000|
| Marketing| 91000| Jeff| CA| 25| 18000|
+----------+-----------+---------------------------+-------------------+-----------------+-------------------+
为了使列名合适,您可以使用下面给出的列名
val dfwithmax1 = df.groupBy("department").agg(max("salary") as "salary", first("employee_name") as "employee_name", first("state") as "state", first("age") as "age",first("bonus") as "bonus")
dfwithmax1.show()
output:
+----------+------+-------------+-----+---+-----+
|department|salary|employee_name|state|age|bonus|
+----------+------+-------------+-----+---+-----+
| Sales| 90000| James| NY| 34|10000|
| Finance| 99000| Maria| CA| 24|23000|
| Marketing| 91000| Jeff| CA| 25|18000|
+----------+------+-------------+-----+---+-----+
如果您仍然想更改数据框列的顺序,可以按如下方式完成
val reOrderedColumnName : Array[String] = Array("employee_name", "department", "state", "salary", "age", "bonus")
val orderedDf = dfwithmax1.select(reOrderedColumnName.head, reOrderedColumnName.tail: _*)
orderedDf.show()
完整代码:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
object test {
def main(args: Array[String]): Unit = {
/** spark session object */
val spark = SparkSession.builder().appName("app-name").master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
val dfwithmax = df.groupBy("department").agg(max("salary"), first("employee_name"), first("state"), first("age"), first("bonus"))
dfwithmax.show()
val dfwithmax1 = df.groupBy("department").agg(max("salary") as "salary", first("employee_name") as "employee_name", first("state") as "state", first("age") as "age",first("bonus") as "bonus")
dfwithmax1.show()
val reOrderedColumnName : Array[String] = Array("employee_name", "department", "state", "salary", "age", "bonus")
val orderedDf = dfwithmax1.select(reOrderedColumnName.head, reOrderedColumnName.tail: _*)
orderedDf.show()
}
}
full output :
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
+----------+-----------+---------------------------+------------------------+-------------------+-----------------+-------------------+
|department|max(salary)|first(employee_name, false)|first(department, false)|first(state, false)|first(age, false)|first(bonus, false)|
+----------+-----------+---------------------------+------------------------+-------------------+-----------------+-------------------+
| Sales| 90000| James| Sales| NY| 34| 10000|
| Finance| 99000| Maria| Finance| CA| 24| 23000|
| Marketing| 91000| Jeff| Marketing| CA| 25| 18000|
+----------+-----------+---------------------------+------------------------+-------------------+-----------------+-------------------+
+----------+------+-------------+----------+-----+---+-----+
|department|salary|employee_name|department|state|age|bonus|
+----------+------+-------------+----------+-----+---+-----+
| Sales| 90000| James| Sales| NY| 34|10000|
| Finance| 99000| Maria| Finance| CA| 24|23000|
| Marketing| 91000| Jeff| Marketing| CA| 25|18000|
+----------+------+-------------+----------+-----+---+-----+
例外:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'department' is ambiguous, could be: department, department.;
这意味着你有两次部门栏。它在 groupby 或 max 中使用,并且您在第一个(“部门”)中也将其称为“部门”。
例如(请最后检查):
val dfwithmax1 = df.groupBy("department").agg(max("salary") as "salary", first("employee_name") as "employee_name", first("department") as "department", first("state") as "state", first("age") as "age",first("bonus") as "bonus")
谢谢!如果有帮助,请点赞。
答案 3 :(得分:0)
此解决方案可能会有所帮助。
from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
from pyspark.sql import Window
name_list = [(101, 'abc', 24), (102, 'cde', 24), (103, 'efg', 22), (104, 'ghi', 21),
(105, 'ijk', 20), (106, 'klm', 19), (107, 'mno', 18), (108, 'pqr', 18),
(109, 'rst', 26), (110, 'tuv', 27), (111, 'pqr', 18), (112, 'rst', 28), (113, 'tuv', 29)]
age_w = Window.partitionBy("age")
name_age_df = sqlContext.createDataFrame(name_list, ['id', 'name', 'age'])
name_age_count_df = name_age_df.withColumn("count", F.count("id").over(age_w)).orderBy("count")
name_age_count_df.show()
+---+----+---+-----+
| id|name|age|count|
+---+----+---+-----+
|109| rst| 26| 1|
|113| tuv| 29| 1|
|110| tuv| 27| 1|
|106| klm| 19| 1|
|103| efg| 22| 1|
|104| ghi| 21| 1|
|105| ijk| 20| 1|
|112| rst| 28| 1|
|101| abc| 24| 2|
|102| cde| 24| 2|
|107| mno| 18| 3|
|111| pqr| 18| 3|
|108| pqr| 18| 3|
+---+----+---+-----+
答案 4 :(得分:0)
您需要记住,聚合函数会减少行,因此您需要使用减少功能来指定想要的行名。如果要保留组中的所有行(警告!这可能会导致爆炸或分区倾斜),则可以将其作为列表收集。然后,您可以使用UDF(用户定义的函数)按照您的条件减少它们,在我的示例中为钱。然后,使用另一个UDF展开单个精简行中的列。 出于这个答案的目的,我假设您希望保留收入最高的人的名字。
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
import scala.collection.mutable
object TestJob3 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
(1, "Moe", "Slap", 2.0, 18),
(2, "Larry", "Spank", 3.0, 15),
(3, "Curly", "Twist", 5.0, 15),
(4, "Laurel", "Whimper", 3.0, 9),
(5, "Hardy", "Laugh", 6.0, 18),
(6, "Charley", "Ignore", 5.0, 5)
).toDF("id", "name", "requisite", "money", "age")
rawDf.show(false)
rawDf.printSchema
val rawSchema = rawDf.schema
val fUdf = udf(reduceByMoney, rawSchema)
val nameUdf = udf(extractName, StringType)
val aggDf = rawDf
.groupBy("age")
.agg(
count(struct("*")).as("count"),
max(col("money")),
collect_list(struct("*")).as("horizontal")
)
.withColumn("short", fUdf($"horizontal"))
.withColumn("name", nameUdf($"short"))
.drop("horizontal")
aggDf.printSchema
aggDf.show(false)
}
def reduceByMoney= (x: Any) => {
val d = x.asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]]
val red = d.reduce((r1, r2) => {
val money1 = r1.getAs[Double]("money")
val money2 = r2.getAs[Double]("money")
val r3 = money1 match {
case a if a >= money2 =>
r1
case _ =>
r2
}
r3
})
red
}
def extractName = (x: Any) => {
val d = x.asInstanceOf[GenericRowWithSchema]
d.getAs[String]("name")
}
}
这是输出
+---+-----+----------+----------------------------+-------+
|age|count|max(money)|short |name |
+---+-----+----------+----------------------------+-------+
|5 |1 |5.0 |[6, Charley, Ignore, 5.0, 5]|Charley|
|15 |2 |5.0 |[3, Curly, Twist, 5.0, 15] |Curly |
|9 |1 |3.0 |[4, Laurel, Whimper, 3.0, 9]|Laurel |
|18 |2 |6.0 |[5, Hardy, Laugh, 6.0, 18] |Hardy |
+---+-----+----------+----------------------------+-------+
答案 5 :(得分:0)
聚合函数会减少组中指定列的行值。如果希望保留其他行值,则需要实现归约逻辑,该逻辑指定每个值所来自的行。例如,将第一行的所有值都保留为age的最大值。为此,您可以使用UDAF(用户定义的聚合函数)来减少组中的行。
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
object AggregateKeepingRowJob {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("ERROR")
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
(1L, "Moe", "Slap", 2.0, 18),
(2L, "Larry", "Spank", 3.0, 15),
(3L, "Curly", "Twist", 5.0, 15),
(4L, "Laurel", "Whimper", 3.0, 15),
(5L, "Hardy", "Laugh", 6.0, 15),
(6L, "Charley", "Ignore", 5.0, 5)
).toDF("id", "name", "requisite", "money", "age")
rawDf.show(false)
rawDf.printSchema
val maxAgeUdaf = new KeepRowWithMaxAge
val aggDf = rawDf
.groupBy("age")
.agg(
count("id"),
max(col("money")),
maxAgeUdaf(
col("id"),
col("name"),
col("requisite"),
col("money"),
col("age")).as("KeepRowWithMaxAge")
)
aggDf.printSchema
aggDf.show(false)
}
}
UDAF:
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class KeepRowWithMaxAmt extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function.
override def inputSchema: org.apache.spark.sql.types.StructType =
StructType(
StructField("store", StringType) ::
StructField("prod", StringType) ::
StructField("amt", DoubleType) ::
StructField("units", IntegerType) :: Nil
)
// This is the internal fields you keep for computing your aggregate.
override def bufferSchema: StructType = StructType(
StructField("store", StringType) ::
StructField("prod", StringType) ::
StructField("amt", DoubleType) ::
StructField("units", IntegerType) :: Nil
)
// This is the output type of your aggregation function.
override def dataType: DataType =
StructType((Array(
StructField("store", StringType),
StructField("prod", StringType),
StructField("amt", DoubleType),
StructField("units", IntegerType)
)))
override def deterministic: Boolean = true
// This is the initial value for your buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
buffer(1) = ""
buffer(2) = 0.0
buffer(3) = 0
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val amt = buffer.getAs[Double](2)
val candidateAmt = input.getAs[Double](2)
amt match {
case a if a < candidateAmt =>
buffer(0) = input.getAs[String](0)
buffer(1) = input.getAs[String](1)
buffer(2) = input.getAs[Double](2)
buffer(3) = input.getAs[Int](3)
case _ =>
}
}
// This is how to merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer2.getAs[String](0)
buffer1(1) = buffer2.getAs[String](1)
buffer1(2) = buffer2.getAs[Double](2)
buffer1(3) = buffer2.getAs[Int](3)
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): Any = {
buffer
}
}
答案 6 :(得分:0)
这是我遇到的一个例子 spark-workshop
val populationDF = spark.read
.option("infer-schema", "true")
.option("header", "true")
.format("csv").load("file:///databricks/driver/population.csv")
.select('name, regexp_replace(col("population"), "\\s", "").cast("integer").as("population"))
val maxPopulationDF = populationDF.agg(max('population).as("populationmax"))
要获取其他列,我在原始DF和聚合的DF之间进行了简单的连接
populationDF.join(maxPopulationDF,populationDF.col("population") === maxPopulationDF.col("populationmax")).select('name, 'populationmax).show()
答案 7 :(得分:0)
这个pyspark代码选择每个B
-group的max([A
, B
]-combination)的A
值(如果一个maximum存在于一个组,随机抽取一个)。
A
在您的情况下将是 age
和 B
您没有分组但仍想选择的任何列。
df = spark.createDataFrame([
[1, 1, 0.2],
[1, 1, 0.9],
[1, 2, 0.6],
[1, 2, 0.5],
[1, 2, 0.6],
[2, 1, 0.2],
[2, 2, 0.1],
], ["group", "A", "B"])
out = (
df
.withColumn("AB", F.struct("A", "B"))
.groupby("group")
# F.max(AB) selects AB-combinations with max `A`. If more
# than one combination remains the one with max `B` is selected. If
# after this identical combinations remain, a single one of them is picked
# randomly.
.agg(F.max("AB").alias("max_AB"))
.select("group", F.expr("max_AB.B"))
)
out.show()
输出
+-----+---+
|group| B|
+-----+---+
| 1|0.6|
| 2|0.1|
+-----+---+
答案 8 :(得分:-6)
你可以这样做:
示例数据:
name age id
abc 24 1001
cde 24 1002
efg 22 1003
ghi 21 1004
ijk 20 1005
klm 19 1006
mno 18 1007
pqr 18 1008
rst 26 1009
tuv 27 1010
pqr 18 1012
rst 28 1013
tuv 29 1011
df.select("name","age","id").groupBy("name","age").count().show();
输出:
+----+---+-----+
|name|age|count|
+----+---+-----+
| efg| 22| 1|
| tuv| 29| 1|
| rst| 28| 1|
| klm| 19| 1|
| pqr| 18| 2|
| cde| 24| 1|
| tuv| 27| 1|
| ijk| 20| 1|
| abc| 24| 1|
| mno| 18| 1|
| ghi| 21| 1|
| rst| 26| 1|
+----+---+-----+