比较scala / spark中不同键的值

时间:2017-12-08 22:21:29

标签: scala apache-spark

我试图找到相关键(但不相同)的值之间的差异。例如,假设我有以下地图:

    import android.text.Editable;
    import android.text.TextWatcher;
    import android.widget.EditText;

    import java.text.DecimalFormat;


    /**
     * Created by srv_twry on 4/12/17.
     * Source: https://stackoverflow.com/a/34265406/137744
     * The custom TextWatcher that automatically adds thousand separators in EditText.
     */

    public class ThousandSeparatorTextWatcher implements TextWatcher {

        private DecimalFormat df;
        private EditText editText;
        private static String thousandSeparator;
        private static String decimalMarker;
        private int cursorPosition;

        public ThousandSeparatorTextWatcher(EditText editText) {
            this.editText = editText;
            df = new DecimalFormat("#,###.##");
            df.setDecimalSeparatorAlwaysShown(true);
            thousandSeparator = Character.toString(df.getDecimalFormatSymbols().getGroupingSeparator());
            decimalMarker = Character.toString(df.getDecimalFormatSymbols().getDecimalSeparator());
        }

        @Override
        public void beforeTextChanged(CharSequence charSequence, int start, int count, int after) {
            cursorPosition = editText.getText().toString().length() - editText.getSelectionStart();
        }

        @Override
        public void onTextChanged(CharSequence charSequence, int i, int i1, int i2) {}

        @Override
        public void afterTextChanged(Editable s) {
            try {
                editText.removeTextChangedListener(this);
                String value = editText.getText().toString();

                if (value != null && !value.equals("")) {
                    if (value.startsWith(decimalMarker)) {
                        String text = "0" + decimalMarker;
                        editText.setText(text);
                    }
                    if (value.startsWith("0") && !value.startsWith("0" + decimalMarker)) {
                        int index = 0;
                        while (index < value.length() && value.charAt(index) == '0') {
                            index++;
                        }
                        String newValue = Character.toString(value.charAt(0));
                        if (index != 0) {
                            newValue = value.charAt(0) + value.substring(index);
                        }
                        editText.setText(newValue);
                    }
                    String str = editText.getText().toString().replaceAll(thousandSeparator, "");
                    if (!value.equals("")) {
                        editText.setText(getDecimalFormattedString(str));
                    }
                    editText.setSelection(editText.getText().toString().length());
                }

                //setting the cursor back to where it was
                editText.setSelection(editText.getText().toString().length() - cursorPosition);
                editText.addTextChangedListener(this);
            } catch (Exception ex) {
                ex.printStackTrace();
                editText.addTextChangedListener(this);
            }
        }

        private static String getDecimalFormattedString(String value) {

            String[] splitValue = value.split("\\.");
            String beforeDecimal = value;
            String afterDecimal = null;
            String finalResult = "";

            if (splitValue.length == 2) {
                beforeDecimal = splitValue[0];
                afterDecimal = splitValue[1];
            }

            int count = 0;
            for (int i = beforeDecimal.length() - 1; i >= 0 ; i--) {
                finalResult = beforeDecimal.charAt(i) + finalResult;
                count++;
                if (count == 3 && i > 0) {
                    finalResult = thousandSeparator + finalResult;
                    count = 0;
                }
            }

            if (afterDecimal != null) {
                finalResult = finalResult + decimalMarker + afterDecimal;
            }

            return finalResult;
        }

        /*
        * Returns the string after removing all the thousands separators.
        * */
        public static String getOriginalString(String string) {
            return string.replace(thousandSeparator,"");
        }
    }

我想将Name_#的内容与Name _(# - 1)进行比较并获得差异。所以,对于上面的例子,我想得到(例如:

(“John_1”,[“a”,”b”,”c”])
(“John_2”,[“a”,”b”])
(“John_3”,[”b”,”c”])
(“Mary_5”,[“a”,”d”])
(“John_5”,[“c”,”d”,”e”])

我正在考虑做某种aggregateByKey然后才找到列表之间的区别,但我不知道如何在我关心的键之间进行匹配,即Name_#with Name _(# - 1)

2 个答案:

答案 0 :(得分:0)

拆分&#34; id&#34;:

import org.apache.spark.sql.functions._

val df = Seq(
  ("John_1", Seq("a","b","c")), ("John_2", Seq("a","b")),
  ("John_3", Seq("b","c")), ("Mary_5", Seq("a","d")),
  ("John_5", Seq("c","d","e"))
).toDF("key", "values").withColumn(
  "user", split($"key", "_")(0)
).withColumn("id", split($"key", "_")(1).cast("long"))

添加窗口:

val w = org.apache.spark.sql.expressions.Window
  .partitionBy($"user").orderBy($"id")

udf

val diff = udf((x: Seq[String], y: Seq[String]) => y.diff(x)

并计算:

 df
   .withColumn("is_previous", coalesce($"id" - lag($"id", 1).over(w) === 1, lit(false)))
   .withColumn("diff", when($"is_previous", diff( lag($"values", 1).over(w), $"values")).otherwise($"values"))
   .show

// +------+---------+----+---+-----------+---------+                               
// |   key|   values|user| id|is_previous|     diff|
// +------+---------+----+---+-----------+---------+
// |Mary_5|   [a, d]|Mary|  5|      false|   [a, d]|
// |John_1|[a, b, c]|John|  1|      false|[a, b, c]|
// |John_2|   [a, b]|John|  2|       true|       []|
// |John_3|   [b, c]|John|  3|       true|      [c]|
// |John_5|[c, d, e]|John|  5|      false|[c, d, e]|
// +------+---------+----+---+-----------+---------+

答案 1 :(得分:0)

我设法解决了我的问题如下: 首先创建一个从当前键

计算前一个键的函数
def getPrevKey(k: String): String = {
  val (n, h) = k.split(“_”)
  val i = h.toInt

  val sb = new StringBuilder
  sb.append(n).append(“_”).append(i-1)

  return sb.toString 
}

然后,使用移位键创建我的RDD的副本:

val copyRdd = myRdd.map(row => {
  val k1 = row._1
  val v1 = row._2

  val k2 = getPrevHour(k1)
  (k2,v1)
})

最后,我将两个RDD联合起来,并通过获取列表之间的差异来减少密钥:

val result = myRdd.union(copyRdd)
  .reduceByKey(_.diff(_))

这让我得到了我需要的确切结果,但是由于联合而存在需要大量内存的问题。最终的结果并不是那么大,但部分结果确实压低了整个过程。