我正在尝试根据某些列值为数据集的行分配一个唯一ID。 例如,假设我们有一个数据集,如下所示:
State Country Person
MH IN ABC
AP IN XYZ
J&K IN XYZ
MH IN PQR
现在,我想根据状态列值分配唯一的ID,如果列值再重复一次,则应填充相同的ID。 输出应如下:
State Country Person Unique_ID
MH IN ABC 1
AP IN XYZ 2
J&K IN XYZ 3
MH IN PQR 1
如何使用Spark Java编程解决此问题。 任何帮助将不胜感激。
答案 0 :(得分:0)
这是使用Java Spark的一种方法。
package com.stackoverflow.works;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.expressions.WindowSpec;
import static org.apache.spark.sql.functions.dense_rank;
import static org.apache.spark.sql.functions.desc;
public class UniqueIdJob {
@SuppressWarnings("serial")
public static class Record implements Serializable {
private String state;
private String country;
private String person;
public Record(String state, String country, String person) {
super();
this.state = state;
this.country = country;
this.person = person;
}
public String getState() {
return state;
}
public void setState(String state) {
this.state = state;
}
public String getCountry() {
return country;
}
public void setCountry(String country) {
this.country = country;
}
public String getPerson() {
return person;
}
public void setPerson(String person) {
this.person = person;
}
}
private static Dataset<Record> createDataset(SparkSession spark) {
List<Record> records = new ArrayList<Record>();
records.add(new Record("MH", "IN", "ABC"));
records.add(new Record("AP", "IN", "XYZ"));
records.add(new Record("J&K", "IN", "XYZ"));
records.add(new Record("MH", "IN", "PQR"));
records.add(new Record("AP", "IN", "XYZ1"));
records.add(new Record("AP", "IN", "XYZ2"));
Encoder<Record> recordEncoder = Encoders.bean(Record.class);
Dataset<Record> recordDataset = spark.createDataset(records,
recordEncoder);
return recordDataset;
}
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("UniqueIdJob")
.master("local[2]").getOrCreate();
Dataset<Record> recordDataset = createDataset(spark);
WindowSpec windowSpec = org.apache.spark.sql.expressions.Window.orderBy(desc("state"));
Dataset<Row> rowDataset = recordDataset.withColumn("id", dense_rank().over(windowSpec));
rowDataset.show();
spark.stop();
}
}
答案 1 :(得分:-1)
它有点慢,但是您可以执行以下操作:
select state,country,person,dense_rank() over(order by state) from ds;
这应该可以完成工作。但是,没有分区的窗口功能会变慢
答案 2 :(得分:-1)
您可以定义自己的UDF(用户定义函数)。然后,您可以编写自己的逻辑来创建唯一ID。
在下面的示例中,我创建了UDF以借助哈希码来获取唯一ID。
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val dataDF=Seq(("MH", "IN","ABC"),("AP", "IN","XYZ"),("J&K","IN","XYZ"),("MH", "IN","PQR")).toDF("State","Country","Person")
dataDF: org.apache.spark.sql.DataFrame = [State: string, Country: string ... 1 more field]
scala> dataDF.createOrReplaceTempView("table1")
scala> def uniqueId(col:String)={col.hashCode}
uniqueId: (col: String)Int
scala> spark.udf.register("uniqueid",uniqueId _)
res1: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))
scala> spark.sql("select state,country,person ,uniqueid(state) as unique_id from table1").show
+-----+-------+------+---------+
|state|country|person|unique_id|
+-----+-------+------+---------+
| MH| IN| ABC| 2459|
| AP| IN| XYZ| 2095|
| J&K| IN| XYZ| 72367|
| MH| IN| PQR| 2459|
+-----+-------+------+---------+