Spark - 使用Java对基于多个字段的元素进行排序

时间:2015-06-25 04:05:09

标签: java apache-spark

我有一个带有Person详细信息的JavaRDD,现在我想先根据Age字段然后在Name字段中对JavaRDD元素进行排序。

示例输入是:

Age, Name, Country
33,Jack,USA
24,Sam,USA
31,Jack,USA

我的输出应该是这样的:

Age, Name, Country
24,Sam,USA
31,Jack,USA
33,Jack,USA

如何使用Sortby转换实现这一目标?

此致 香卡

2 个答案:

答案 0 :(得分:2)

它在java中非常难看(那些scalas case类非常方便)但你可以通过为记录创建bean并实现可比较的方法来实现。现在只需使用带有标识键功能的sortBy方法:

'http://' + window.location.hostname + window.location.pathname;

答案 1 :(得分:1)

以下代码将根据需要执行任务 - >

JavaRDD<String> people = sc.textFile("/home/hduser/input");

// The schema is encoded in a string
String schemaString = "Age  Name    Country";

// Generate the schema based on the string of schema
List<StructField> fields = new ArrayList<StructField>();
for (String fieldName : schemaString.split("    ")) {
fields.add(DataTypes.createStructField(fieldName,
DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);

// Convert records of the RDD (people) to Rows.
JavaRDD<Row> rowRDD = people.map(new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return RowFactory.create(fields[0], fields[1].trim(),
fields[2].trim());
}
});

// Apply the schema to the RDD.
DataFrame peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);

// Register the DataFrame as a table.
peopleDataFrame.registerTempTable("people");

// SQL can be run over RDDs that have been registered as tables.
DataFrame results = sqlContext.sql("SELECT * FROM people").sort("Age");

results.show();