将集合转换为矩阵:如何在Spark

时间:2018-02-03 15:43:46

标签: apache-spark rdd

我有JavaPairRDD,其中包含以下对:

(key0, (a,d))
(key1, (c))
(key2, (b,d,e))
(key3, (a,c,d))    

现在,我想完成以下任务:

  1. 将所有值组合在一起(无需担心键)以获得“通用空间”:(a,b,c,d,e)

  2. 使用1将每个值转换为向量,以显示值包含来自通用空间的元素,否则为0。例如,第一个值为(a,d),应将其转换为(1,0,0,1,0),第二个值为(c),因此应将其转换为(0,0,1,0,0),依此类推向前。转换完成后,我将获得以下新对RDD:

    (key0, (1,0,0,1,0))
    (key1, (0,0,1,0,0))
    (key2, (0,1,0,1,1))
    (key3, (1,0,1,1,0))
  3. 有人可以告诉我使用Spark(Java)实现这一目标的最有效方法是什么?任何指导将不胜感激!

1 个答案:

答案 0 :(得分:0)

部分导入:

import org.apache.spark.sql.*;
import org.apache.spark.sql.types.StructType;

将数据转换为Dataset<Row>

SparkSession spark = SparkSession.builder().getOrCreate();

JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());


List<Tuple2<String, String[]>> data  = Arrays.asList(
        new Tuple2<>("key0", new String [] {"a", "d"}),
        new Tuple2<>("key1", new String [] {"c"}),
        new Tuple2<>("key2", new String [] {"b", "d", "e"}),
        new Tuple2<>("key3", new String [] {"a", "c", "d"})
);

JavaPairRDD<String, String[]> rdd = JavaPairRDD.fromJavaRDD(jsc.parallelize(data));

StructType schema = StructType.fromDDL("key string, value array<string>");


Dataset<Row> df = spark.createDataFrame(
        rdd.map((Function<Tuple2<String, String[]>, Row>) value -> RowFactory.create(value._1(), value._2())),
        schema
);

并应用CountVectorizer

CountVectorizer vectorizer = new CountVectorizer().setInputCol("value").setOutputCol("vector").setBinary(true);

vectorizer.fit(df).transform(df).show();

结果

+----+---------+--------------------+
| key|    value|              vector|
+----+---------+--------------------+
|key0|   [a, d]| (5,[0,1],[1.0,1.0])|
|key1|      [c]|       (5,[2],[1.0])|
|key2|[b, d, e]|(5,[0,3,4],[1.0,1...|
|key3|[a, c, d]|(5,[0,1,2],[1.0,1...|
+----+---------+--------------------+