我正在尝试获取每列和计数中出现次数最多的值。我能够在python数据框描述函数中执行此操作,但是当数据非常大时,它无法在群集中运行。我真的需要您的帮助。
复制步骤:
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
from pyspark.sql.functions import collect_set,collect_list
data = [("A","abc"),("B","abc"),("A","cad"),("A","abc")]
profSchema = StructType([StructField("var1", StringType(), True),\
StructField("var2", StringType(), True)])
fin_df = spark.createDataFrame(data, schema=profSchema)
collist = fin_df.columns
distinct = fin_df.select(*[collect_list(c).alias(c) for c in collist]).take(1)[0]
输出在列表对象中:
Row(var1=[u'A', u'B', u'A', u'A'], var2=[u'abc', u'abc', u'cad', u'abc'])
正在寻找某种输出,例如:colname,mostfrequetvalue,count