如何使用Java在Apache Spark中正确制作句子的TF-IDF向量?

时间:2017-01-19 22:53:12

标签: java apache-spark apache-spark-mllib tf-idf

我有这段代码,

public class TfIdfExample {
        public static void main(String[] args){
            JavaSparkContext sc = SparkSingleton.getContext();
            SparkSession spark = SparkSession.builder()
                    .config("spark.sql.warehouse.dir", "spark-warehouse")
                    .getOrCreate();
            JavaRDD<List<String>> documents = sc.parallelize(Arrays.asList(
                    Arrays.asList("this is a sentence".split(" ")),
                    Arrays.asList("this is another sentence".split(" ")),
                    Arrays.asList("this is still a sentence".split(" "))), 2);


            HashingTF hashingTF = new HashingTF();
            documents.cache();
            JavaRDD<Vector> featurizedData = hashingTF.transform(documents);
            // alternatively, CountVectorizer can also be used to get term frequency vectors

            IDF idf = new IDF();
            IDFModel idfModel = idf.fit(featurizedData);

            featurizedData.cache();

            JavaRDD<Vector> tfidfs = idfModel.transform(featurizedData);
            System.out.println(tfidfs.collect());
            KMeansProcessor kMeansProcessor = new KMeansProcessor();
            JavaPairRDD<Vector,Integer> result = kMeansProcessor.Process(tfidfs);
            result.collect().forEach(System.out::println);
        }
    }

我需要获取k-means的向量,但我得到奇数向量

[(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),
     (1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),
     (1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])]

在k-means工作后我得到它

((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),0)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),0)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),0)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)

但我认为它的工作不正确,因为tf-idf必须有另一个视图。 我认为mllib已经准备好了这方法,但我测试了文档示例,但没有得到我需要的东西。 Spark的自定义解决方案我还没有找到。可能有人与它合作并给我回答我做错了什么?可能是我没有正确使用mllib功能?

1 个答案:

答案 0 :(得分:2)

在TF-IDF为SparseVector之后你得到了什么。

为了更好地理解这些值,让我从TF向量开始:

(1048576,[489554,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[455491,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[489554,540177,560488,736740,894973],[1.0,1.0,1.0,1.0,1.0])

例如,对应于第一个句子的TF向量是1048576= 2^20)分量向量,其中4个非零值对应于489554,540177,736740和{{1}的索引},所有其他值都为零,因此不存储在稀疏矢量表示中。

特征向量的维数等于您散列到的桶的数量:894973桶。

对于这种大小的语料库,您应该考虑减少桶的数量:

1048576 = 2^20
建议

2的幂以最小化散列冲突的数量。

接下来,您应用IDF权重:

HashingTF hashingTF = new HashingTF(32);

如果我们再次查看第一个句子,我们得到3个零 - 这是预期的,因为术语“this”,“is”和“句子”出现在语料库的每个文档中,所以by definition of IDF将等于零。

为什么零值仍在(稀疏)向量中?因为在当前实现中,the size of the vector is kept the same并且只有值乘以IDF。