I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy
I don't get a data frame back.
For example, I have a DataFrame
called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame
to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame
to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData
objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
答案 0 :(得分:7)
Just use distinct
not groupBy
:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData
is not intended to be used directly. It provides a number of methods, where agg
is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame
. In terms of SQL what you have after where
and groupBy
is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ...
has to be provided by agg
or equivalent method.
答案 1 :(得分:1)
在spark中跟随聚合的一个组,然后一个select语句将返回一个数据框。对于你的例子,它应该是这样的:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")