我有JavaPairRDD
,其中包含以下对:
(key0, (a,d))
(key1, (c))
(key2, (b,d,e))
(key3, (a,c,d))
现在,我想完成以下任务:
将所有值组合在一起(无需担心键)以获得“通用空间”:(a,b,c,d,e)
使用1
将每个值转换为向量,以显示值包含来自通用空间的元素,否则为0
。例如,第一个值为(a,d)
,应将其转换为(1,0,0,1,0)
,第二个值为(c)
,因此应将其转换为(0,0,1,0,0)
,依此类推向前。转换完成后,我将获得以下新对RDD:
(key0, (1,0,0,1,0)) (key1, (0,0,1,0,0)) (key2, (0,1,0,1,1)) (key3, (1,0,1,1,0))
有人可以告诉我使用Spark(Java)实现这一目标的最有效方法是什么?任何指导将不胜感激!
答案 0 :(得分:0)
部分导入:
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.StructType;
将数据转换为Dataset<Row>
:
SparkSession spark = SparkSession.builder().getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
List<Tuple2<String, String[]>> data = Arrays.asList(
new Tuple2<>("key0", new String [] {"a", "d"}),
new Tuple2<>("key1", new String [] {"c"}),
new Tuple2<>("key2", new String [] {"b", "d", "e"}),
new Tuple2<>("key3", new String [] {"a", "c", "d"})
);
JavaPairRDD<String, String[]> rdd = JavaPairRDD.fromJavaRDD(jsc.parallelize(data));
StructType schema = StructType.fromDDL("key string, value array<string>");
Dataset<Row> df = spark.createDataFrame(
rdd.map((Function<Tuple2<String, String[]>, Row>) value -> RowFactory.create(value._1(), value._2())),
schema
);
并应用CountVectorizer
CountVectorizer vectorizer = new CountVectorizer().setInputCol("value").setOutputCol("vector").setBinary(true);
vectorizer.fit(df).transform(df).show();
结果
+----+---------+--------------------+
| key| value| vector|
+----+---------+--------------------+
|key0| [a, d]| (5,[0,1],[1.0,1.0])|
|key1| [c]| (5,[2],[1.0])|
|key2|[b, d, e]|(5,[0,3,4],[1.0,1...|
|key3|[a, c, d]|(5,[0,1,2],[1.0,1...|
+----+---------+--------------------+