计算pyspark中各列之间的差异

时间:2017-04-28 15:25:57

标签: pyspark multiple-columns variance

如何计算pyspark中众多列的方差? 对于例如如果pyspark.sql.dataframe表是:

ID  A   B   C
1   12  15  7
2   6   15  2
3   56  25  25
4   36  12  5

和所需的输出是

ID  A   B   C   Variance
1   12  15  7   10.9
2   6   15  2   29.6
3   56  25  25  213.6
4   36  12  5   176.2

pyspark中有一个方差函数,但它只能按列方式工作。

1 个答案:

答案 0 :(得分:2)

使用.method private static void Foo(object o) cil managed { .maxstack 1 ldarg.0 isinst int32 brfalse.s L_00 ldarg.0 unbox.any int32 call void [mscorlib]System.Console::WriteLine(int32) L_00: ldarg.0 isinst valuetype [mscorlib]System.Nullable`1<int32> brfalse.s L_01 ldarg.0 unbox valuetype [mscorlib]System.Nullable`1<int32> call instance !0 valuetype [mscorlib]System.Nullable`1<int32>::GetValueOrDefault() call void [mscorlib]System.Console::WriteLine(int32) L_01: ldarg.0 unbox valuetype [mscorlib]System.Nullable`1<int32> call instance bool valuetype [mscorlib]System.Nullable`1<int32>::get_HasValue() brtrue.s L_02 ldstr "No value!" call void [mscorlib]System.Console::WriteLine(string) L_02: ret } 函数连接所需的列,并使用udf计算方差,如下所示

Dispatcher.Invoke(new Action(() =>
            {
                EnableContent();
            }));
        }
        catch (AggregateException e)
        {
            MessageBox.Show(e.ToString());
            Dispatcher.Invoke(new Action(() =>
            {
                UpdateLoadMsg("No internet connection.", MsgType.FAIL);
            }));
        }
        catch (Exception e)
        {
            MessageBox.Show(e.ToString());
            Dispatcher.Invoke(new Action(() =>
            {
                UpdateLoadMsg("Something went wrong.", MsgType.FAIL);
            }));
        }

输出:

concat_ws