在SparkRD中的JavaRDD <string> .foreach之后,Arraylist是空的

时间:2017-04-24 15:46:20

标签: java apache-spark java-7 anonymous-function

示例json(总共100条记录):

  

{&#34;名称&#34;:&#34;设备&#34;&#34;工资&#34;:10000&#34;职业&#34;:&#34; ENGG&#34; &#34;地址&#34;:&#34;诺依&#34;}   {&#34;名称&#34;:&#34; KARTHIK&#34;&#34;工资&#34;:20000&#34;职业&#34;:&#34; ENGG&#34;&# 34;地址&#34;:&#34;诺依&#34;}

有用的代码:

   final List<Map<String,String>> jsonData = new ArrayList<>();

   DataFrame df =  sqlContext.read().json("file:///home/dev/data-json/emp.json");
   JavaRDD<String> rdd = df.repartition(1).toJSON().toJavaRDD(); 

   rdd.foreach(new VoidFunction<String>() {
       @Override
       public void call(String line)  {
           try {
               jsonData.add (new ObjectMapper().readValue(line, Map.class));
               System.out.println(Thread.currentThread().getName());
               System.out.println("List size: "+jsonData.size());
           } catch (IOException e) {
               e.printStackTrace();
           }
       }
   });

   System.out.println(Thread.currentThread().getName());
   System.out.println("List size: "+jsonData.size());

jsonData最后是空的。

输出:

Executor task launch worker-1
List size: 1
Executor task launch worker-1
List size: 2
Executor task launch worker-1
List size: 3
.
.
.
Executor task launch worker-1
List size: 100

main
List size: 0

1 个答案:

答案 0 :(得分:1)

我已经测试过,这个有效 https://github.com/freedev/spark-test

final ObjectMapper objectMapper = new ObjectMapper();

List<Map<String, Object>> list = rdd
        .map(new org.apache.spark.api.java.function.Function<String, Map<String, Object>>() {
            @Override
            public Map<String, Object> call(String line) throws Exception {
                TypeReference<Map<String, Object>> typeRef = new TypeReference<Map<String, Object>>() {
                };
                Map<String, Object> rs = objectMapper.readValue(line, typeRef);
                return rs;
            }
        }).collect();

我更喜欢映射Map<String, Object>,因为这将处理不在Json中值部分不是字符串的情况(即"salary":20000)。