为什么针对S3的pyspark sql查询返回null

时间:2019-01-19 01:14:33

标签: amazon-s3 pyspark null amazon-emr amazon-athena

在Athena中针对S3源运行相同查询时,与在EMR(1 x 10)群集上的pyspark脚本中执行相同查询时,我得到了不同的结果。我从雅典娜那里得到了数据,但是我得到的只是脚本的空值。关于原因的任何建议/想法/猜测吗?

这是雅典娜查询:

    public List<string> GetCurrentUserGroupList()
    {
        List<string> currentGroupList = new List<string>();

        try
        {
            if (this.User != null && this.User.Identity != null)
            {
                foreach (System.Security.Principal.IdentityReference group in
                         System.Web.HttpContext.Current.Request.LogonUserIdentity.Groups)
                {
                    currentGroupList.Add(group.Translate(typeof
                        (System.Security.Principal.NTAccount)).ToString());
                }
            }
        }
        catch (Exception exc)
        {
            Console.WriteLine(exc.Message);
            Console.WriteLine(exc.StackTrace);
        }
        return currentGroupList;
    }

哪个返回此结果:

SELECT <real_col1> as reg_plate, <real_col1> as model_num
FROM <my Athena table name> 
WHERE partition_datetime LIKE '2019-01-01-14' 
limit 10

但是,当我以脚本的形式运行此查询时,使用以下命令针对同一S3源:

reg_plate   model_num
   515355  961-824
   515355  961-824
   515355  961-824
   515355  961-824
   341243  047-891
   727027  860-403
   619656  948-977
   576345  951-657
   576345  951-657
   113721  034-035

像空一样,我什么也没得到

# Define SQL query
load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num
FROM s3_table
WHERE partition_datetime LIKE '2019-01-01-14'
limit 10  """

df1 = spark.read.parquet("<s3:path to my data>")
df1.createOrReplaceTempView("s3_table")

sqlDF = spark.sql(load_qry)
sqlDF.show(10)

这是我集群上的配置,该配置是1个主r3.xlarge和10个r3.xlarge工人: Cluster config

这是我用来启动火花作业的命令字符串:+---------+---------+ |reg_plate|model_num| +---------+---------+ | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| +---------+---------+

1 个答案:

答案 0 :(得分:0)

我找到了一个简单的解决方案。

代替

load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num 
FROM s3_table WHERE partition_datetime LIKE '2019-01-01-14' limit 10 """ 
df1 = spark.read.parquet("<s3:path to my data>") 
df1.createOrReplaceTempView("s3_table") 

我用过

load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num
FROM <my_athena_db>.table WHERE partition_datetime LIKE '2019-01-01-14' 
df1 = spark.sql(load_qry)

之所以起作用,是因为Glue知道如何进入“ my_athena_db.table”