的数据集。
参考dataSet,我想计算第四列中不同条目的数量,我在python中有代码,但是无法使用spark在java中实现它。
Python代码:
user_data = sc.textFile(dataSet path)
//counting number of occupations
num_occupations = user_fields.map(lambda fields:
fields[3]).distinct().count()
答案 0 :(得分:0)
您可以使用groupBy和count:
num_occupations = user_fields.groupBy("name_of_your_column").count()
参考:
https://spark.apache.org/docs/2.1.1/api/java/index.html#package
https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/RelationalGroupedDataset.html
答案 1 :(得分:0)
val df = sc.parallelize(
Seq(
(892,36,"M","other","45243"),
(893,25,"M","student","95823"),
(894,47,"M","education","74075"),
(895,31,"F","librarian","74075"),
(896,28,"M","writer","91505"),
(897,30,"M","hommaker","61755")
)
).toDF("a","b","c","d","e")
val df2 = df.groupBy("d").agg(collect_set("a")).show()
结果:
scala> val df = sc.parallelize(
| Seq(
| (892,36,"M","other","45243"),
| (893,25,"M","student","95823"),
| (894,47,"M","education","74075"),
| (895,31,"F","librarian","74075"),
| (896,28,"M","writer","91505"),
| (897,30,"M","hommaker","61755")
| )
| ).toDF("a","b","c","d","e");
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 3 more fields]
scala> val df2 = df.groupBy("d").agg(collect_set("a")).show()
+---------+--------------+
| d|collect_set(a)|
+---------+--------------+
|librarian| [895]|
| hommaker| [897]|
|education| [894]|
| writer| [896]|
| other| [892]|
| student| [893]|
+---------+--------------+