Question

在Athena中针对S3源运行相同查询时，与在EMR（1 x 10）群集上的pyspark脚本中执行相同查询时，我得到了不同的结果。我从雅典娜那里得到了数据，但是我得到的只是脚本的空值。关于原因的任何建议/想法/猜测吗？

这是雅典娜查询：

    public List<string> GetCurrentUserGroupList()
    {
        List<string> currentGroupList = new List<string>();

        try
        {
            if (this.User != null && this.User.Identity != null)
            {
                foreach (System.Security.Principal.IdentityReference group in
                         System.Web.HttpContext.Current.Request.LogonUserIdentity.Groups)
                {
                    currentGroupList.Add(group.Translate(typeof
                        (System.Security.Principal.NTAccount)).ToString());
                }
            }
        }
        catch (Exception exc)
        {
            Console.WriteLine(exc.Message);
            Console.WriteLine(exc.StackTrace);
        }
        return currentGroupList;
    }

哪个返回此结果：

SELECT <real_col1> as reg_plate, <real_col1> as model_num
FROM <my Athena table name> 
WHERE partition_datetime LIKE '2019-01-01-14' 
limit 10

但是，当我以脚本的形式运行此查询时，使用以下命令针对同一S3源：

reg_plate   model_num
   515355  961-824
   515355  961-824
   515355  961-824
   515355  961-824
   341243  047-891
   727027  860-403
   619656  948-977
   576345  951-657
   576345  951-657
   113721  034-035

像空一样，我什么也没得到

# Define SQL query
load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num
FROM s3_table
WHERE partition_datetime LIKE '2019-01-01-14'
limit 10  """

df1 = spark.read.parquet("<s3:path to my data>")
df1.createOrReplaceTempView("s3_table")

sqlDF = spark.sql(load_qry)
sqlDF.show(10)

这是我集群上的配置，该配置是1个主r3.xlarge和10个r3.xlarge工人：

Answer 1

我找到了一个简单的解决方案。

代替

load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num 
FROM s3_table WHERE partition_datetime LIKE '2019-01-01-14' limit 10 """ 
df1 = spark.read.parquet("<s3:path to my data>") 
df1.createOrReplaceTempView("s3_table")

我用过

load_qry = """SELECT <real_col1> as reg_plate, <real_col2> as model_num
FROM <my_athena_db>.table WHERE partition_datetime LIKE '2019-01-01-14' 
df1 = spark.sql(load_qry)

之所以起作用，是因为Glue知道如何进入“ my_athena_db.table”

为什么针对S3的pyspark sql查询返回null

1 个答案: