Using groupBy in Spark and getting back to a DataFrame

时间:2015-11-12 11:15:26

标签: scala apache-spark apache-spark-sql

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.

For example, I have a DataFrame called logs that has the following form:

machine_id  | event     | other_stuff
 34131231   | thing     |   stuff
 83423984   | notathing | notstuff
 34131231   | thing    | morestuff

and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using

val machineId = logs
  .where($"event" === "thing")
  .select("machine_id")
  .groupBy("machine_id")

I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.

I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:

  1. Extract unique id's from a log table.
  2. Use unique ids to extract all events for a particular id.
  3. Use some kind of analysis on this data that has been extracted.

It's the first two steps I would appreciate some guidance with here.

I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.

Thanks

2 个答案:

答案 0 :(得分:7)

Just use distinct not groupBy:

val machineId = logs.where($"event"==="thing").select("machine_id").distinct

Which will be equivalent to SQL:

SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'

GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this

SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id

where ... has to be provided by agg or equivalent method.

答案 1 :(得分:1)

在spark中跟随聚合的一个组,然后一个select语句将返回一个数据框。对于你的例子,它应该是这样的:

val machineId = logs
    .groupBy("machine_id", "event")
    .agg(max("other_stuff") )
    .select($"machine_id").where($"event" === "thing")