pyspark SQL逻辑合取

时间:2019-02-13 09:29:19

标签: sql python-3.x pyspark pyspark-sql vertica

  

我正在3个VM(即1个主节点; 2个从属节点)上运行spark 1.6   核心,从我的Vertica数据库检索数据以对其进行处理   下面的查询在vertica db上运行良好,但是在   pyspark,Spark DataFrames支持JDBC谓词下推   源,但谓词一词在严格的SQL含义中使用。它的意思是   它仅涵盖WHERE子句。而且看起来它仅限于   逻辑连接(恐怕没有IN和OR)且简单   谓词,无论如何都可以按原样执行以下查询?

    url = "*******"
    properties = {
        "user": "*****",
        "password": "*******",
        "driver": "com.vertica.jdbc.Driver"
    }
df = (sqlContext.read.format("jdbc")
.options(url=url, dbtable='(SELECT min(date(time_stamp)) mindate,max(date(time_stamp)) maxdate,count (distinct date(time_stamp)) noofdays, subscriber, server_hostname, sum(bytes_in) DL, sum(bytes_out) UL, sum(connections_out) conn from traffic.stats where \$CONDITIONS and SUBSCRIBER like '41601%' and date(time_stamp) between '2019-01-25' and '2019-01-29'and signature_service_category = 'Web Browsing' and (signature_service_name = 'SSL v3' or signature_service_name = 'HTTP2 over TLS') and server_hostname not like '%.googleapis.%' and server_hostname not like '%.google.%' and server_hostname <> 'doubleclick.net'  and server_hostname <> 'youtube.com'  and server_hostname <> 'googleadservices.com'  and server_hostname <> 'app-measurement.com' and server_hostname <> 'gstatic.com' and server_hostname <> 'googlesyndication.com' and server_hostname <> 'google-analytics.com'  and server_hostname <> 'googleusercontent.com'  and server_hostname <> 'ggpht.com'  and server_hostname <> 'googletagmanager.com' and server_hostname is not null group by subscriber, server_hostname)temp',
         **properties)
                .load()).show()

0 个答案:

没有答案