Java中的Apche火花

时间:2017-07-25 06:04:02

标签: java

的数据集。

DataSet

参考dataSet,我想计算第四列中不同条目的数量,我在python中有代码,但是无法使用spark在java中实现它。

Python代码:

user_data = sc.textFile(dataSet path)

//counting number of occupations
num_occupations = user_fields.map(lambda fields:
fields[3]).distinct().count()

2 个答案:

答案 0 :(得分:0)

您可以使用groupBy和count:

num_occupations = user_fields.groupBy("name_of_your_column").count()

参考:

https://spark.apache.org/docs/2.1.1/api/java/index.html#package

https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/RelationalGroupedDataset.html

答案 1 :(得分:0)

val df = sc.parallelize(
Seq(
  (892,36,"M","other","45243"),
  (893,25,"M","student","95823"),
  (894,47,"M","education","74075"),
  (895,31,"F","librarian","74075"),
  (896,28,"M","writer","91505"),
  (897,30,"M","hommaker","61755")
  )
  ).toDF("a","b","c","d","e")


val df2 = df.groupBy("d").agg(collect_set("a")).show()

结果:

scala> val df = sc.parallelize(
     | Seq(
     |   (892,36,"M","other","45243"),
     |   (893,25,"M","student","95823"),
     |   (894,47,"M","education","74075"),
     |   (895,31,"F","librarian","74075"),
     |   (896,28,"M","writer","91505"),
     |   (897,30,"M","hommaker","61755")
     |   )
     |   ).toDF("a","b","c","d","e");
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 3 more fields]

scala> val df2 = df.groupBy("d").agg(collect_set("a")).show()
+---------+--------------+
|        d|collect_set(a)|
+---------+--------------+
|librarian|         [895]|
| hommaker|         [897]|
|education|         [894]|
|   writer|         [896]|
|    other|         [892]|
|  student|         [893]|
+---------+--------------+