我试图遍历RDD并在每行上应用一些逻辑,然后将其发送到API。
但是RDD不在while循环内。
if (dataFrame.toJSON().toJavaRDD().take(1).size() > 0) {
System.out.println("jsonString:#######");
// System.out.println(dataFrame.toJSON().toJavaRDD().take(1));
dataFrame.toJSON().toJavaRDD().foreachPartition(new VoidFunction<Iterator<String>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Iterator < String > jsonString) throws Exception {
System.out.println("#######");
while (jsonString.hasNext()) {
final String str = jsonString.next();
if (str != null && !str.equals("")) {
System.out.println("jsonString:" + jsonString);
}
}
}
});
}
答案 0 :(得分:2)
以防万一,这是我用来测试案例的程序。
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class StackOverflow20190326_2 {
public static void main(String args[]) {
SparkSession spark = SparkSession.builder().appName("StackOverflow20190326").master("local").getOrCreate();
// generate a dummy 2-liner dataset
Dataset<Row> ds = spark.sql("select 1 as idx, 'this is line 1' as value union select 2 as idx, 'This is the second line' as value");
test(ds);
spark.stop();
}
private static void test(Dataset<Row> dataFrame) {
JavaRDD<String> javaRDD = dataFrame.toJSON().toJavaRDD();
if (javaRDD.take(1).size() > 0) {
System.out.println("jsonString:#######");
javaRDD.foreachPartition(jsonString -> {
System.out.println("#######" + jsonString);
while (jsonString.hasNext()) {
final String str = jsonString.next();
if (str != null && !str.equals("")) {
System.out.println("jsonString:" + str);
}
}
});
}
}
}
输出如下:
jsonString:####### #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(empty iterator) #######IteratorWrapper(non-empty iterator) jsonString:{"idx":1,"value":"this is line 1"} #######IteratorWrapper(non-empty iterator) jsonString:{"idx":2,"value":"This is the second line"}
如您所见,有很多空分区,但是最初的两行输出很好。
正如我从maven的pom.xml中看到的那样,我正在使用spark 2.4:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.borgoltz.test</groupId>
<artifactId>spark-client</artifactId>
<version>0.0.1-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>2.4.0</version>
</parent>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
</project>
最后但并非最不重要
您是否在本地模式下运行?因为否则可能会在远程执行器上调用.foreachPartition()
调用中的闭包,因此println
将在运行驱动程序的其他计算机上输出。
一种简单的验证方法是检查执行器上的日志,或通过写入HDFS来替换System.out.println,例如...
HTH!
答案 1 :(得分:0)
这对我有用
if (dataFrame.take(1).length > 0) {
Iterator<String> itt = dataFrame.toJSON().toJavaRDD().collect().iterator();
while(itt.hasNext()) {
String field = itt.next();
JSONObject jsonResponse = new JSONObject(field);
System.out.println("jsonString:" + jsonResponse );
}