Spark 1.3.1中的LDA。将原始数据转换为Term Document Matrix?

时间:2015-11-11 21:30:08

标签: java apache-spark lda

我正在尝试使用Java中的Spark 1.3.1进行LDA并得到此错误:

Error: application failed with exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "��"

我的.txt文件如下所示:  现在,重量找到困难拉起来  失明疾病一切眼睛工作完美除了能力采取光使用光形式图像  榜样的孩子   亲爱的回忆悲伤的童年记忆

这是代码:

import scala.Tuple2;

import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;

public class JavaLDA {
  public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName("LDA Example");
    JavaSparkContext sc = new JavaSparkContext(conf);

    // Load and parse the data
    String path = "/tutorial/input/askreddit20150801.txt";
    JavaRDD<String> data = sc.textFile(path);
    JavaRDD<Vector> parsedData = data.map(
        new Function<String, Vector>() {
          public Vector call(String s) {
            String[] sarray = s.trim().split(" ");
            double[] values = new double[sarray.length];
            for (int i = 0; i < sarray.length; i++)
              values[i] = Double.parseDouble(sarray[i]);
            return Vectors.dense(values);
          }
        }
    );
    // Index documents with unique IDs
    JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
        new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
          public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
            return doc_id.swap();
          }
        }
    ));
    corpus.cache();

    // Cluster the documents into three topics using LDA
    LDAModel ldaModel = new LDA().setK(100).run(corpus);

    // Output topics. Each is a distribution over words (matching word count vectors)
    System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
        + " words):");
    Matrix topics = ldaModel.topicsMatrix();
    for (int topic = 0; topic < 100; topic++) {
      System.out.print("Topic " + topic + ":");
      for (int word = 0; word < ldaModel.vocabSize(); word++) {
        System.out.print(" " + topics.apply(word, topic));
      }
      System.out.println();
    }

    ldaModel.save(sc.sc(), "myLDAModel");

  }
}

任何人都知道为什么会这样吗?我只是第一次尝试LDA Spark。感谢。

2 个答案:

答案 0 :(得分:0)

values[i] = Double.parseDouble(sarray[i]);

为什么要尝试将文本文件的每个单词转换为Double?

这是您问题的答案: http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#parseDouble%28java.lang.String%29

答案 1 :(得分:0)

您的代码期望输入文件是一串看起来像数字的空白分隔文本行。假设你的文字是单词:

获取您的语料库中显示的每个单词的列表:

JavaRDD<String> words =
        data.flatMap((FlatMapFunction<String, String>) s -> {
            s = s.replaceAll("[^a-zA-Z ]", "");
            s = s.toLowerCase();
            return Arrays.asList(s.split(" "));
        });

制作一张地图,为每个单词提供与之关联的整数:

Map<String, Long> vocab = words.zipWithIndex().collectAsMap();

然后,而不是你的parsedData做它正在做的事情,让它查找每个单词,找到相关的数字,转到数组中的那个位置,并为该单词的计数加1。 / p>

JavaRDD<Vector> tokens = data.map(
        (Function<String, Vector>) s -> {
            String[] vals = s.split("\\s");
            double[] idx = new double[vocab.size() + 1];
            for (String val : vals) {
                idx[vocab.get(val).intValue()] += 1.0;
            }
            return Vectors.dense(idx);
        }
    );

这会导致向量的RDD,其中每个向量都是vocab.size()long,向量中的每个点都是该词汇出现在该行中的次数。

我从我目前正在使用的内容中稍微修改了这段代码并且没有对其进行测试,因此可能存在错误。祝你好运!