Spark上的SQL:如何获取DISTINCT的所有值?

时间:2016-03-20 17:45:21

标签: sql apache-spark-sql

所以,假设我有下表:

Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue

我想得到一个表格,列出每个名字的不同颜色 - 多少及其价值。意思是这样的:

Name | Distinct | Values
--------------------------------------
John |   2      | Blue, Yellow
Greg |   2      | Red, Blue

任何想法怎么做?

2 个答案:

答案 0 :(得分:4)

collect_list将为您提供一个列表而不删除重复项。 collect_set将自动删除重复项 所以只是

select 
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name

此功能已实施,因为spark 1.6.0检查出来:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

/**
   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   *
   * For now this is an alias for the collect_set Hive UDAF.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(columnName: String): Column = collect_set(Column(columnName))

答案 1 :(得分:0)

对于PySPark;我来自R / Pandas背景,所以我实际上发现Spark Dataframes更容易使用。

要做到这一点:

  1. 设置Spark SQL上下文
  2. 将您的文件读入数据集
  3. 将您的数据框注册为临时表
  4. 使用SQL语法直接查询
  5. 将结果保存为对象,输出到文件......你的东西
  6. 这是我为此创建的课程:

    class SQLspark():
    
    def __init__(self, local_dir='./', hdfs_dir='/users/', master='local', appname='spark_app', spark_mem=2):
        self.local_dir = local_dir
        self.hdfs_dir = hdfs_dir
        self.master = master
        self.appname = appname
        self.spark_mem = int(spark_mem)
        self.conf = (SparkConf()
               .setMaster(self.master)
               .setAppName(self.appname)
               .set("spark.executor.memory", self.spark_mem))
        self.sc = SparkContext(conf=self.conf)
        self.sqlContext = SQLContext(self.sc)
    
    
    def file_to_df(self, input_file):
        # import file as dataframe, all cols will be imported as strings
        df = self.sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").option("inferSchema", "true").load(input_file)
        # # cache df object to avoid rebuilding each time
        df.cache()
        # register as temp table for querying, use 'spark_df' as table name
        df.registerTempTable("spark_df")
        return df
    
    # you also cast a spark dataframe as a pandas df
    def sparkDf_to_pandasDf(self, input_df):
        pandas_df = input_df.toPandas()
        return pandas_df
    
    def find_distinct(self, col_name):
        my_query = self.sqlContext.sql("""SELECT distinct {} FROM spark_df""".format(col_name))
       # now do your thing with the results etc
        my_query.show() 
        my_query.count()
        my_query.collect()
    
    ###############
    if __name__ == '__main__':
    
    # instantiate class 
    # see function for variables to input
    spark = TestETL(os.getcwd(), 'hdfs_loc', "local", "etl_test", 10)
    
    
    # specify input file to process
    tsv_infile = 'path/to/file'