我为自己的数据定义了自定义类Person
,并使用groupByKey
操作,如下所示:
public class Person implements Serializable {
private static final long serialVersionUID = 1L;
private int personId;
private String name;
private String address;
public Person(int personId, String name, String address) {
this.personId = personId;
this.name = name;
this.address = address;
}
public int getPersonId() { return personId;}
public void setPersonId(int personId) { this.personId = personId;}
public String getName() { return name;}
public void setName(String name) { this.name = name;}
public String getAddress() { return address;}
public void setAddress(String address) { this.address = address;}
}
List<Person> personList = new ArrayList<Person>();
personList.add(new Person(111, "abc", "test1"));
personList.add(new Person(222, "def", "test2"));
personList.add(new Person(333, "fhg", "test3"));
personList.add(new Person(111, "jkl", "test4"));
personList.add(new Person(555, "mno", "test5"));
personList.add(new Person(444, "pqr", "test6"));
personList.add(new Person(111, "xyz", "test7"));
JavaRDD<Person> initialRDD = sc.parallelize(personList, 4);
JavaPairRDD<Person, Iterable<Person>> groupedBy =
initialRDD.cartesian(initialRDD).groupByKey();
但是使用以下内容的结果不会根据键进行任何分组。
groupedBy.foreach(x -> System.out.println(x._1.getPersonId()));
结果是:222 111 555 444 555 111 222 111 333 222 444 111 111 111 444 111 333 111 111 222 555 111 333 333 444 111 111 555
我希望结果只是唯一的键。我对Spark中groupByKey
函数的理解是错误的吗?
答案 0 :(得分:1)
groupByKey
与其他byKey
操作相同,取决于hashCode
和equals
的有意义实施。由于您没有提供自己的实现,Person
将使用默认的实现,这在这种情况下是无用的。
试着举例:
@Override
public int hashCode() {
return this.personId;
}
@Override
public boolean equals(Object o) {
return this.hashCode() == o.hashCode();
}