Apache Spark - 工作节点

时间:2017-12-05 05:28:21

标签: apache-spark apache-spark-sql apache-spark-2.0

我们以json的形式获取实时机器数据,我们从RabbitMQ获取这些数据。下面是json的样本,



{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}




数据的窗口持续时间为' X'分钟然后是我们想要实现的目标

  1. 通过deviceId对数据进行分组,这已经完成但不确定我们是否可以获取数据集

  2. 我们希望循环使用上面的分组数据,并使用foreachPartition为每个设备执行聚合逻辑,以便在工作节点内执行代码。

  3. 如果我的思考过程出错,请纠正我。

    我们之前的代码是收集数据,循环遍历RDD,将它们转换为DataSet,并使用Spark SqlContext api在DataSet上应用聚合逻辑。

    在进行负载测试时,我们看到90%的处理都发生在主节点上,过了一段时间,CPU使用率飙升至100%,并且该过程遭到轰炸。

    所以我们现在正在尝试重新设计整个过程,以便在工作节点中执行最大的逻辑。

    下面是到目前为止实际在工作节点中工作的代码,但我们还没有获得用于聚合逻辑的DataSet

    
    
    public static void main(String[] args) {
    		
    		try {
    			
    			mconf = new SparkConf();
    			mconf.setAppName("OnPrem");
    			mconf.setMaster("local[*]");
    			
    			JavaSparkContext sc = new JavaSparkContext(mconf);
    			  
    			jssc = new JavaStreamingContext(sc, Durations.seconds(60));
    
    			SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
    			//spksess.sparkContext().setLogLevel("ERROR");
    			
    			Map<String, String> rabbitMqConParams = new HashMap<String, String>();
    			rabbitMqConParams.put("hosts", "localhost");
    			rabbitMqConParams.put("userName", "guest");
    			rabbitMqConParams.put("password", "guest");
    			rabbitMqConParams.put("vHost", "/");
    			rabbitMqConParams.put("durable", "true");
    			
    			List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
    			distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
    			
    			Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
    
    				public String call(Delivery message) {
    					return new String(message.getBody());
    				}
    			};
    			
    			JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
    			
    			JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
    			machineDataRDD.print();
    			
    			JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s)); 
    			
    			JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();	
    			
    			groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
    
    				@Override
    				public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
    					
    					data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
    
    						@Override
    						public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
    						 
    							 while(data.hasNext()){
    								 LOGGER.error("Machine Data == >>"+data.next());
    							 }
    						}
    						
    					});
    					 
    				}
    			
    			});
    			jssc.start();
    			jssc.awaitTermination();
    			
    		}
    		catch (Exception e) 
    		{
    			e.printStackTrace();
    		}
    &#13;
    &#13;
    &#13;

    下面的分组代码为我们提供了一个设备的Iterable字符串,理想情况下我们想获得一个DataSet

    &#13;
    &#13;
    JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
    JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
    &#13;
    &#13;
    &#13;

    对我来说重要的是使用foreachPartition进行循环,以便执行代码被推送到工作节点。

1 个答案:

答案 0 :(得分:0)

在查看更多代码示例和指南sqlcontext之后,sparksession未在工作节点上序列化并可用,因此我们将改变不使用foreachpartition循环构建数据集的策略。