Question

我正在尝试编写Spark Java自定义接收器，并且在接收器中我需要访问Cassandra数据库。我让Spark以群集模式运行，至少有2个工作人员。

这是我的 Java Custom Spark Receiver

@Service
public class MyCustomReceiver extends Receiver<MyData>{

    private static final Log logger = LogFactory.getLog(MyCustomReceiver.class);

    @Autowired
    private MyAppConfig myAppConfig;

    @Autowired
    private CassandraDataService cassandraDataService;

    public MyCustomReceiver() {
        super(StorageLevel.MEMORY_AND_DISK_2());
        logger.debug("Initiated...");
    }

    @Override
    public void onStart() {
        // Start the thread that receives data over a connection
        logger.debug("Calling the receive method...");
        receive();
        logger.debug("Done.. calling the receive method...");
    }

    private void receive() {
        logger.debug("receive method called...");

        List<String> myConfigs = myAppConfig.getMyConfig();
        logger.debug("Received myConfigs..." + myConfigs);

        for(String myConfigStr : myConfigs)
        {
            ObjectMapper mapper = new ObjectMapper();
            MyConfig myConfig;

            try {

                while (!isStopped()) {

                    myConfig = mapper.readValue(myConfigStr, MyConfig.class);

                    logger.debug("Parsed the myConfig..." + myConfig);

                    // Check for matching data in Cassandra
                    List<MyData> cassandraRows = cassandraDataService.getMatches(myConfig);

                    for(MyData myData : cassandraRows)
                    {
                        System.out.println("Received data '" + myData + "'");
                    }

                    store(cassandraRows.iterator());
                }

            } catch (IOException e) {
                e.printStackTrace();
            }

        }
    }

    @Override
    public void onStop() {

    }
}

Spark应用程序/驱动程序 -

@SpringBootApplication
public class MySpringBootSparkApp {
    private static final Log logger = LogFactory.getLog(MySpringBootSparkApp.class);
    public static void main(String[] args) {
        logger.debug("Initiated MySpringBootSparkApp...");

        SpringApplication.run(MySpringBootSparkApp.class, args);
        SparkConf sparkConf = new SparkConf().setAppName("Spark Processing Boot App");
        JavaStreamingContext jsc = new JavaStreamingContext(sparkConf, new Duration(1000));
        JavaReceiverInputDStream<MyData> myDataDStream = jsc.receiverStream(
                new MyCustomReceiver());
        myDataDStream.foreachRDD(myDataJavaRDD -> {
            logger.debug("myDataJavaRDD = " + myDataJavaRDD);
            myDataJavaRDD.foreach(myData -> {
                System.out.println("myData = " + myData);
            });
        });
    } 
}

当我将具有上述应用程序的uber jar和所有依赖项提交到具有至少2个工作节点的集群时，我看到一个worker拾取了Driver程序并启动了Custom Receiver处理。日志不会显示是否还有其他事情发生 - 例如调用自定义接收器，从Cassandra获取数据或者数据是否返回到驱动程序。

这是Spark Conf目录中的log4j.properties -

log4j.rootCategory=DEBUG, console    
log4j.appender.console=org.apache.log4j.ConsoleAppender    
log4j.appender.console.target=System.err    
log4j.appender.console.layout=org.apache.log4j.PatternLayout    
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n    

# Set the default spark-shell log level to WARN. When running the spark-shell, the    
# log level for this class is used to overwrite the root logger's log level, so that    
# the user can have different defaults for the shell and regular Spark apps.    
log4j.logger.org.apache.spark.repl.Main=DEBUG    

# Settings to quiet third party logs that are too verbose    
log4j.logger.org.spark_project.jetty=WARN    
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR    
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO    
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO    
log4j.logger.org.apache.parquet=ERROR    
log4j.logger.parquet=ERROR    

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support    
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL    
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

我不知道我怎么能弄清楚上面的代码是怎么回事以及为什么MyData记录我希望由Receiver返回的内容永远不会打印在Spark主程序中我无法推断如果它完全归还。任何有关如何进行的指导将不胜感激。

由于

Answer 1

我认为这一个...... 我没有在JavaStreamingContext上调用start（）

jsc.start();
jsc.awaitTermination();

一旦我这样做，整个Java App开始给我提供我正在寻找的东西。干杯！

Spark Custom Receiver使用Spring Boot访问Cassandra

1 个答案: