我们以json的形式获取实时机器数据,我们从RabbitMQ获取这些数据。下面是json的样本,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}

数据的窗口持续时间为' X'分钟然后是我们想要实现的目标
通过deviceId对数据进行分组,这已经完成但不确定我们是否可以获取数据集
我们希望循环使用上面的分组数据,并使用foreachPartition为每个设备执行聚合逻辑,以便在工作节点内执行代码。
如果我的思考过程出错,请纠正我。
我们之前的代码是收集数据,循环遍历RDD,将它们转换为DataSet,并使用Spark SqlContext api在DataSet上应用聚合逻辑。
在进行负载测试时,我们看到90%的处理都发生在主节点上,过了一段时间,CPU使用率飙升至100%,并且该过程遭到轰炸。
所以我们现在正在尝试重新设计整个过程,以便在工作节点中执行最大的逻辑。
下面是到目前为止实际在工作节点中工作的代码,但我们还没有获得用于聚合逻辑的DataSet
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
@Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
@Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
&#13;
下面的分组代码为我们提供了一个设备的Iterable字符串,理想情况下我们想获得一个DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
&#13;
对我来说重要的是使用foreachPartition进行循环,以便执行代码被推送到工作节点。
答案 0 :(得分:0)
在查看更多代码示例和指南sqlcontext之后,sparksession未在工作节点上序列化并可用,因此我们将改变不使用foreachpartition循环构建数据集的策略。