Stackoverflower同时在apache spark中使用distinct

时间:2017-05-12 10:14:44

标签: java apache-spark rdd apache-spark-2.0

我使用Spark 2.0.1

我试图在JavaRDD中找到不同的值,如下所示

JavaRDD<String> distinct_installedApp_Ids = filteredInstalledApp_Ids.distinct();

我看到这一行抛出了以下异常

Exception in thread "main" java.lang.StackOverflowError
    at org.apache.spark.rdd.RDD.checkpointRDD(RDD.scala:226)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:84)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
    at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
   ..........

一次又一次地重复相同的堆栈跟踪。 输入filteredInstalledApp_Ids具有包含数百万条记录的大输入。这个问题是记录的数量还是在JavaRDD中找到不同值的有效方法。任何帮助将非常感激。提前致谢。欢呼声。

编辑1:

添加过滤方法

JavaRDD<String> filteredInstalledApp_Ids = installedApp_Ids
        .filter(new Function<String, Boolean>() {
        @Override
          public Boolean call(String v1) throws Exception {
                return v1 != null;
            }
          }).cache();

编辑2:

添加了用于生成installedApp_Ids的方法

 public JavaRDD<String> getIdsWithInstalledApps(String inputPath, JavaSparkContext sc,
        JavaRDD<String> installedApp_Ids) {

    JavaRDD<String> appIdsRDD = sc.textFile(inputPath);
    try {
        JavaRDD<String> appIdsRDD1 = appIdsRDD.map(new Function<String, String>() {
            @Override
            public String call(String t) throws Exception {
                String delimiter = "\t";
                String[] id_Type = t.split(delimiter);
                StringBuilder temp = new StringBuilder(id_Type[1]);
                if ((temp.indexOf("\"")) != -1) {
                    String escaped = temp.toString().replace("\\", "");
                    escaped = escaped.replace("\"{", "{");
                    escaped = escaped.replace("}\"", "}");
                    temp = new StringBuilder(escaped);
                }
                // To remove empty character in the beginning of a
                // string
                JSONObject wholeventObj = new JSONObject(temp.toString());
                JSONObject eventJsonObj = wholeventObj.getJSONObject("eventData");
                int appType = eventJsonObj.getInt("appType");
                if (appType == 1) {
                    try {                           
                        return (String.valueOf(appType));
                    } catch (JSONException e) {
                        return null;
                    }
                }
                return null;
            }
        }).cache();
        if (installedApp_Ids != null)
            return sc.union(installedApp_Ids, appIdsRDD1);
        else
            return appIdsRDD1;
    } catch (Exception e) {
        e.printStackTrace();
    }
    return null;
}

0 个答案:

没有答案