Question

最终我想要的是DataFrame中所有列的列模式。对于其他摘要统计，我看到了几个选项：使用DataFrame聚合，或将DataFrame的列映射到向量的RDD（我也遇到了麻烦）并使用MLlib中的colStats。但我不认为模式是一种选择。

Answer 1

模式问题与中位数几乎相同。虽然它易于计算，但计算相当昂贵。它既可以使用sort，也可以使用本地和全局聚合，或者使用just-another-wordcount和filter：

import numpy as np
np.random.seed(1)

df = sc.parallelize([
    (int(x), ) for x in np.random.randint(50, size=10000)
]).toDF(["x"])

cnts = df.groupBy("x").count()
mode = cnts.join(
    cnts.agg(max("count").alias("max_")), col("count") == col("max_")
).limit(1).select("x")
mode.first()[0]
## 0

无论哪种方式，每列都可能需要完全随机播放。

Answer 2

您可以使用Java代码计算列模式，如下所示：

            case MODE:
                Dataset<Row> cnts = ds.groupBy(column).count();
                Dataset<Row> dsMode = cnts.join(
                        cnts.agg(functions.max("count").alias("max_")),
                        functions.col("count").equalTo(functions.col("max_")
                        ));
                Dataset<Row> mode = dsMode.limit(1).select(column);
                replaceValue = ((GenericRowWithSchema) mode.first()).values()[0];
                ds = replaceWithValue(ds, column, replaceValue);
                break;

private static Dataset<Row> replaceWithValue(Dataset<Row> ds, String column, Object replaceValue) {
    return ds.withColumn(column,
            functions.coalesce(functions.col(column), functions.lit(replaceValue)));
}

Answer 3

>>> df=newdata.groupBy('columnName').count()
>>> mode = df.orderBy(df['count'].desc()).collect()[0][0]

See My result

>>> newdata.groupBy('var210').count().show()
+------+-----+
|var210|count|
+------+-----+
|  3av_|   64|
|  7A3j|  509|
|  g5HH| 1489|
|  oT7d|  109|
|  DM_V|  149|
|  uKAI|44883|
+------+-----+

# store the above result in df
>>> df=newdata.groupBy('var210').count()
>>> df.orderBy(df['count'].desc()).collect()
[Row(var210='uKAI', count=44883),
Row(var210='g5HH', count=1489),
Row(var210='7A3j', count=509),
Row(var210='DM_V', count=149),
Row(var210='oT7d', count=109),
Row(var210='3av_', count=64)]

# get the first value using collect()
>>> mode = df.orderBy(df['count'].desc()).collect()[0][0]
>>> mode
'uKAI'

使用groupBy（）函数获取列中每个类别的计数。 df是我的结果数据框有两列var210，count。使用orderBy（），列名'count'按降序排列，给出数据帧第1行的最大值。 collect（）[0] [0]用于获取数据帧中的1个元组

Answer 4

此行将为您提供Spark数据帧df中的“ col”模式：

df.groupby（“ col”）。count（）。orderBy（“ count”，ascending = False）.first（）[0]

有关df中所有列使用的模式列表：

[df.columns中i的[df.groupby（i）.count（）。orderBy（“ count”，ascending = False）.first（）[0]]

要添加名称以标识列的哪种模式，请制作2D列表：

对于df.columns中的i，[[i，df.groupby（i）.count（）。orderBy（“ count”，ascending = False）.first（）[0]]

Answer 5

以下方法可以帮助您获取输入数据帧的所有列的模式

from pyspark.sql.functions import monotonically_increasing_id

def get_mode(df):
    column_lst = df.columns
    res = [df.select(i).groupby(i).count().orderBy("count", ascending=False) for i in column_lst]
    df_mode = res[0].limit(1).select(column_lst[0]).withColumn("temp_name_monotonically_increasing_id", monotonically_increasing_id())
    
    for i in range(1, len(res)):
        df2 = res[i].limit(1).select(column_lst[i]).withColumn("temp_name_monotonically_increasing_id", monotonically_increasing_id())
        df_mode = df_mode.join(df2, (df_mode.temp_name_monotonically_increasing_id == df2.temp_name_monotonically_increasing_id)).drop(df2.temp_name_monotonically_increasing_id)
        
    return df_mode.drop("temp_name_monotonically_increasing_id")

计算PySpark DataFrame列的模式？

5 个答案: