对于来自Spark MinHashLSHModel的不同文档,为什么JaccardDistance始终为0,大约为{SimilarityJoin

时间:2019-10-31 17:46:13

标签: apache-spark machine-learning minhash

我是Spark ML的新手。 Spark ML具有Jaccard Distance的MinHash实现。请参阅文档https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance。在示例代码中,用于比较的输入数据来自矢量。我对示例代码没有任何疑问。但是,当我使用文本文档作为输入,然后通过word2Vec将它们转换为矢量时,我得到了0个jaccard距离。不知道我的代码有什么问题。我不明白的东西。在此先感谢您的帮助。

SparkSession spark = SparkSession.builder().appName("TestMinHashLSH").config("spark.master", "local").getOrCreate();

List<Row> data1 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
            RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
            RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" "))));

List<Row> data2 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Scala".split(" "))),
            RowFactory.create(Arrays.asList("I wish python could also use case classes".split(" "))));

StructType schema4word = new StructType(new StructField[] {
            new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) });
Dataset<Row> documentDF1 = spark.createDataFrame(data1, schema4word);

// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(30).setMinCount(0);

Word2VecModel w2vModel1 = word2Vec.fit(documentDF1);
Dataset<Row> result1 = w2vModel1.transform(documentDF1);

List<Row> myDataList1 = new ArrayList<>();      
int id = 0;
for (Row row : result1.collectAsList()) {
    List<String> text = row.getList(0);
    Vector vector = (Vector) row.get(1);
    myDataList1.add(RowFactory.create(id++, vector));
}
StructType schema1 = new StructType(
        new StructField[] { new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) });

Dataset<Row> df1 = spark.createDataFrame(myDataList1, schema1);

Dataset<Row> documentDF2 = spark.createDataFrame(data2, schema4word);

Word2VecModel w2vModel2 = word2Vec.fit(documentDF2);
Dataset<Row> result2 = w2vModel2.transform(documentDF2);

List<Row> myDataList2 = new ArrayList<>();      
id = 10;
for (Row row : result2.collectAsList()) {
    List<String> text = row.getList(0);
    Vector vector = (Vector) row.get(1);
    System.out.println("Text: " + text + " => \nVector: " + vector + "\n");
    myDataList2.add(RowFactory.create(id++, vector));
}

Dataset<Row> df2 = spark.createDataFrame(myDataList2, schema1);

MinHashLSH mh = new MinHashLSH().setNumHashTables(5).setInputCol("features").setOutputCol("hashes");

MinHashLSHModel model = mh.fit(df1);

// Feature Transformation
System.out.println("The hashed dataset where hashed values are stored in the column 'hashes':");
model.transform(df1).show();

// Compute the locality sensitive hashes for the input rows, then perform
// approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed
// dataset, e.g.
// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
System.out.println("Approximately joining df1 and df2 on Jaccard distance smaller than 0.6:");
model.approxSimilarityJoin(df1, df2, 1.6, "JaccardDistance")
        .select(col("datasetA.id").alias("id1"), col("datasetB.id").alias("id2"), col("JaccardDistance"))
        .show();

// $example off$

spark.stop();

从Word2Vec,我得到了不同文档的不同向量。比较两个不同的文档时,我希望JaccardDistance获得一些非零值。但是,相反,我得到了全0。下面显示了我运行程序时得到的结果:

文本:[嗨,我听说过Scala] => 矢量:[0.005808539432473481,-0.001387741044163704,0.007890049391426146,...,04969391227]

文本:[我,希望,python也可以使用,大小写,类] => 矢量:[-0.0022146602132124826,0.0032128597667906906433,-0.00658524181926623,...,-3.716901264851913E-4]

在Jaccard距离小于0.6的情况下大约连接df1和df2: + --- + --- + --------------- + | id1 | id2 | JaccardDistance | + --- + --- + --------------- + | 1 | 11 | 0.0 | | 0 | 10 | 0.0 | | 2 | 11 | 0.0 | | 0 | 11 | 0.0 | | 1 | 10 | 0.0 | | 2 | 10 | 0.0 | + --- + --- + --------------- +

1 个答案:

答案 0 :(得分:0)

Jaccard的定义和火花实现的相似性在两组之间。

作为Spark文档:

两组的雅卡距离由其基数定义 交集和并集:

d(A,B)=1−|A∩B|/|A∪B|

因此,当您将word2vec应用于特定文档时,会将其转换为向量空间或嵌入以捕获文本的语义。而且,示例中向量中每个元素的范围看起来都小于1。这对于带有jaccard距离的minhash来说是一个问题。如果您仍然想使用word2vec,请选择余弦距离。

正确的带有jaccard距离的预处理步骤应该是

  1. CountVectorizer
  2. 或者您可以自己对令牌进行散列并使用向量汇编器

Minhash需要二进制矢量,非零值将被视为二进制“ 1”值。

有关工作示例,请参阅Uber提供的以下示例:https://eng.uber.com/lsh/