在Spark Streaming中聚合后努力处理重复数据删除

时间:2018-11-15 15:09:55

标签: scala apache-spark duplicates spark-streaming spark-structured-streaming

1.streaming数据来自kafka 2.通过火花流消费 3.firstname,lastname,userid和membername(使用成员名我正在获取成员数 例如mark,tyson,2,chris,lisa,iwanka-所以这里的会员数是3

我必须以某种方式计算其必要性。但是如何在聚合后删除重复数据删除。是我的关注

    String urlDownload = <YOUR_URL>;
    DownloadManager.Request request = new DownloadManager.Request(Uri.parse(urlDownload));

    request.setDescription("Testando");
    request.setTitle("Download");
    request.allowScanningByMediaScanner();
    request.setNotificationVisibility(DownloadManager.Request.VISIBILITY_VISIBLE_NOTIFY_COMPLETED);
    request.setDestinationInExternalPublicDir(Environment.DIRECTORY_DOWNLOADS,"teste.zip");

    final DownloadManager manager = (DownloadManager) getSystemService(Context.DOWNLOAD_SERVICE);

    final long downloadId = manager.enqueue(request);

    final ProgressBar mProgressBar = (ProgressBar) findViewById(R.id.progressBar1);

    new Thread(new Runnable() {

        @Override
        public void run() {

            boolean downloading = true;

            while (downloading) {

                DownloadManager.Query q = new DownloadManager.Query();
                q.setFilterById(downloadId);

                Cursor cursor = manager.query(q);
                cursor.moveToFirst();
                int bytes_downloaded = cursor.getInt(cursor
                        .getColumnIndex(DownloadManager.COLUMN_BYTES_DOWNLOADED_SO_FAR));
                int bytes_total = cursor.getInt(cursor.getColumnIndex(DownloadManager.COLUMN_TOTAL_SIZE_BYTES));

                if (cursor.getInt(cursor.getColumnIndex(DownloadManager.COLUMN_STATUS)) == DownloadManager.STATUS_SUCCESSFUL) {
                    downloading = false;
                }

                final int dl_progress = (int) ((bytes_downloaded * 100l) / bytes_total);

                runOnUiThread(new Runnable() {

                    @Override
                    public void run() {

                        mProgressBar.setProgress((int) dl_progress);

                    }
                });

                Log.d(Constants.MAIN_VIEW_ACTIVITY, statusMessage(cursor));
                cursor.close();
            }

        }
    }).start();

第1批输出

  val df2=df.select(firstname,lastname,membercount,userid)
  df2.writestream.format("console").start().awaitTermination

  or     
 df3.select("*").where("membercount >= 3").dropDuplication("userid")

 // this one is not working , but i need to do the same after
   count only so that in batches same user id will not come again.
   only first time entry i want.

batch-2输出

  firstname         lastname          member-count            userid

  john              smith                   5                  1
  mark              boucher                 8                  2
  shawn              pollock                3                  3

//但在这里我要批处理-2 ---------输出

1。可能是约翰·史密斯(John Smith),下一个批次的休克·波洛克数将再次增加,但是我不想显示或保留下一个批次的产量。

即基于userid,我只想一次输入批处理输出 并在批处理输出中再次忽略同一用户        名姓成员数用户标识      克里斯·乔丹6 4

1 个答案:

答案 0 :(得分:0)

您的问题很难看懂,但据我了解,您希望有条件的while循环吗?

var a = 10;
while(a < 20){
     println( "Value of a: " + a );
     a = a + 1;
  }

例如将打印

value of a: 10
value of a: 11
value of a: 12
value of a: 13
value of a: 14
value of a: 15
value of a: 16
value of a: 17
value of a: 18
value of a: 19