无法通过javaRDD进行迭代

时间:2019-03-26 13:01:34

标签: java apache-spark

我试图遍历RDD并在每行上应用一些逻辑,然后将其发送到API。

但是RDD不在while循环内。

if (dataFrame.toJSON().toJavaRDD().take(1).size() > 0) {

    System.out.println("jsonString:#######");

    // System.out.println(dataFrame.toJSON().toJavaRDD().take(1));

    dataFrame.toJSON().toJavaRDD().foreachPartition(new VoidFunction<Iterator<String>>() {
      private static final long serialVersionUID = 1L;               
   @Override
    public void call(Iterator < String > jsonString) throws Exception {
      System.out.println("#######");

      while (jsonString.hasNext()) {
        final String str = jsonString.next();
        if (str != null && !str.equals("")) {

          System.out.println("jsonString:" + jsonString);


        }

      }

    }
  });
}

2 个答案:

答案 0 :(得分:2)

以防万一,这是我用来测试案例的程序。

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class StackOverflow20190326_2 {
    public static void main(String args[]) {
        SparkSession spark = SparkSession.builder().appName("StackOverflow20190326").master("local").getOrCreate();

        // generate a dummy 2-liner dataset
        Dataset<Row> ds = spark.sql("select 1 as idx, 'this is line 1' as value union select 2 as idx, 'This is the second line' as value");

        test(ds);

        spark.stop();

    }

    private static void test(Dataset<Row> dataFrame) {

        JavaRDD<String> javaRDD = dataFrame.toJSON().toJavaRDD();
        if (javaRDD.take(1).size() > 0) {

            System.out.println("jsonString:#######");

            javaRDD.foreachPartition(jsonString -> {
                System.out.println("#######" + jsonString);

                while (jsonString.hasNext()) {
                    final String str = jsonString.next();
                    if (str != null && !str.equals("")) {

                        System.out.println("jsonString:" + str);

                    }

                }

            });
        }
    }
}

输出如下:


    jsonString:#######
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(empty iterator)
    #######IteratorWrapper(non-empty iterator)
    jsonString:{"idx":1,"value":"this is line 1"}
    #######IteratorWrapper(non-empty iterator)
    jsonString:{"idx":2,"value":"This is the second line"}

如您所见,有很多空分区,但是最初的两行输出很好。

正如我从maven的pom.xml中看到的那样,我正在使用spark 2.4:

<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.borgoltz.test</groupId>
    <artifactId>spark-client</artifactId>
    <version>0.0.1-SNAPSHOT</version>


    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>

    <parent>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-parent_2.12</artifactId>
        <version>2.4.0</version>
    </parent>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
    </dependencies>

</project>

最后但并非最不重要

您是否在本地模式下运行?因为否则可能会在远程执行器上调用.foreachPartition()调用中的闭包,因此println将在运行驱动程序的其他计算机上输出。 一种简单的验证方法是检查执行器上的日志,或通过写入HDFS来替换System.out.println,例如...

HTH!

答案 1 :(得分:0)

这对我有用

if (dataFrame.take(1).length > 0) {
    Iterator<String> itt = dataFrame.toJSON().toJavaRDD().collect().iterator();
    while(itt.hasNext()) { 
        String field = itt.next();
        JSONObject jsonResponse = new JSONObject(field);
        System.out.println("jsonString:" + jsonResponse );

}