用于复杂sql的PySpark sql!= NULL&不在

时间:2017-12-23 19:08:24

标签: pyspark apache-spark-sql spark-dataframe pyspark-sql

我有一个像这样的火花数据框:

sc = CassandraSparkContext(conf=conf)
sql = SQLContext(sc)
log = sc.cassandraTable("test","log_a")\
            .select("m_date","userid","fsa","fsid").toDF()
sql.registerDataFrameAsTable(log, "log")

我可以使用m_date中的范围轻松查询:

query_str = ("select * from log where m_date >= %s and m_date < %s" %(1497052766,1498059766))
temp=sql.sql(query_str)
temp.show()

对于这个简单的查询,每件事情都可以。但我对这个更复杂的查询有这样的问题:

query_str = "select * from log "\
                "where userid != NULL "\
                "or fsa not in ("\
                "select fsa from log where userid is not null)"
query_str = query_str+ ("and m_date > %s and m_date < %s" %(1497052766,1498059766))
temp=sql.sql(query_str)

我遇到了这个问题:

Py4JJavaError                             Traceback (most recent call last)
C:\opt\spark\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

C:\opt\spark\spark-2.2.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:

Py4JJavaError: An error occurred while calling o25.sql.
: org.apache.spark.sql.AnalysisException: Null-aware predicate sub-queries cannot be used in nested conditions: (NOT (userid#1 = null) || ((NOT fsa#2 IN (list#62 []) && (m_date#0L > cast(1497052766 as bigint))) && (m_date#0L < cast(1498059766 as bigint))));;
Project [m_date#0L, userid#1, fsa#2, fsid#3]
+- Filter (NOT (userid#1 = null) || ((NOT fsa#2 IN (list#62 []) && (m_date#0L > cast(1497052766 as bigint))) && (m_date#0L < cast(1498059766 as bigint))))
   :  +- Project [fsa#2]
   :     +- Filter isnotnull(userid#1)
   :        +- SubqueryAlias log
   :           +- LogicalRDD [m_date#0L, userid#1, fsa#2, fsid#3]
   +- SubqueryAlias log
      +- LogicalRDD [m_date#0L, userid#1, fsa#2, fsid#3]

        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:207)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
        at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Unknown Source)


During handling of the above exception, another exception occurred:

AnalysisException                         Traceback (most recent call last)
E:\FPT\project-spark-streaming\spark-calculate-newuser-daily.py in <module>()
     76                 "select fsa from log where userid is not null)"
     77         query_str=query_str+ ("and m_date > %s and m_date < %s" %(1497052766,1498059766))
---> 78         temp=sql.sql(query_str)
     79         pass

C:\opt\spark\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\context.py in sql(self, sqlQuery)
    382         [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
    383         """
--> 384         return self.sparkSession.sql(sqlQuery)
    385
    386     @since(1.0)

C:\opt\spark\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py in sql(self, sqlQuery)
    601         [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
    602         """
--> 603         return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
    604
    605     @since(2.0)

C:\opt\spark\spark-2.2.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134
   1135         for temp_arg in temp_args:

C:\opt\spark\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: 'Null-aware predicate sub-queries cannot be used in nested conditions: (NOT (userid#1 = null) || ((NOT fsa#2 IN (list#62 []) && (m_date#0L > cast(1497052766 as bigint))) && (m_date#0L < cast(1498059766 as bigint))));;\nProject [m_date#0L, userid#1, fsa#2, fsid#3]\n+- Filter (NOT (userid#1 = null) || ((NOT fsa#2 IN (list#62 []) && (m_date#0L > cast(1497052766 as bigint))) && (m_date#0L < cast(1498059766 as bigint))))\n   :  +- Project [fsa#2]\n   :     +- Filter isnotnull(userid#1)\n   :
        +- SubqueryAlias log\n   :           +- LogicalRDD [m_date#0L, userid#1, fsa#2, fsid#3]\n   +- SubqueryAlias log\n      +- LogicalRDD [m_date#0L, userid#1, fsa#2, fsid#3]\n'
17/12/24 20:53:17 WARN SparkEnv: Exception while deleting Spark temp dir: C:\Users\hptphuong\AppData\Local\Temp\spark-c9fd644d-de1a-47c9-9e19-cbd0b01df138\userFiles-412a0e89-c56f-4897-98e7-05cd6114855f
java.io.IOException: Failed to delete: C:\Users\hptphuong\AppData\Local\Temp\spark-c9fd644d-de1a-47c9-9e19-cbd0b01df138\userFiles-412a0e89-c56f-4897-98e7-05cd6114855f
        at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1031)
        at org.apache.spark.SparkEnv.stop(SparkEnv.scala:103)
        at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1944)
        at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1943)
        at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
17/12/24 20:53:17 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\hptphuong\AppData\Local\Temp\spark-c9fd644d-de1a-47c9-9e19-cbd0b01df138\userFiles-412a0e89-c56f-4897-98e7-05cd6114855f

我尝试用或将两个条件分开,它会没问题。但是,当我结合它时出现这个问题。我试图替换&#34;或&#34;通过工会两桌。有用。但它看起来很荒谬。

请告诉我如何修复它。

非常感谢

@AKSW:soorr因缺乏有关该问题的信息。我更新了我的问题。请帮我。

2 个答案:

答案 0 :(得分:1)

我在您的代码中看到的一个明显错误是!= NULL比较。要检查某项是否为空时,应分别使用IS NULLIS NOT NULL

我看到的另一个问题不是使用圆括号对条件进行分组,但我假设您知道您在使用逻辑做些什么。

我建议按以下方式重写查询,看看它是否对您有用:

query_str = '''
SELECT *
FROM log
WHERE (m_date > {0} AND m_date < {1})
AND (userid IS NOT NULL
    OR fsa NOT IN (
        SELECT fsa FROM log WHERE userid IS NOT NULL
    )
)'''.format(1497052766, 1498059766)

temp=sql.sql(query_str)

但是,我应该添加一条注释(如上面的注释中所述),Spark中的SQL支持不完整,是否有效取决于Spark版本以及您在查询是可为空的列。在这种情况下,您将不得不编写单独的查询并根据您的逻辑将它们加入。

答案 1 :(得分:0)

Not IN (Subquery)在Spark 2.0中有一些限制(请参阅THIS)。 您仍然可以使用EXISTS / NOT EXISTS

PS:请指定您的Spark版本,以帮助遇到相同问题的其他人