Question

我正在3个VM（即1个主节点; 2个从属节点）上运行spark 1.6 核心，从我的Vertica数据库检索数据以对其进行处理下面的查询在vertica db上运行良好，但是在 pyspark，Spark DataFrames支持JDBC谓词下推源，但谓词一词在严格的SQL含义中使用。它的意思是它仅涵盖WHERE子句。而且看起来它仅限于逻辑连接（恐怕没有IN和OR）且简单谓词，无论如何都可以按原样执行以下查询？

    url = "*******"
    properties = {
        "user": "*****",
        "password": "*******",
        "driver": "com.vertica.jdbc.Driver"
    }
df = (sqlContext.read.format("jdbc")
.options(url=url, dbtable='(SELECT min(date(time_stamp)) mindate,max(date(time_stamp)) maxdate,count (distinct date(time_stamp)) noofdays, subscriber, server_hostname, sum(bytes_in) DL, sum(bytes_out) UL, sum(connections_out) conn from traffic.stats where \$CONDITIONS and SUBSCRIBER like '41601%' and date(time_stamp) between '2019-01-25' and '2019-01-29'and signature_service_category = 'Web Browsing' and (signature_service_name = 'SSL v3' or signature_service_name = 'HTTP2 over TLS') and server_hostname not like '%.googleapis.%' and server_hostname not like '%.google.%' and server_hostname <> 'doubleclick.net'  and server_hostname <> 'youtube.com'  and server_hostname <> 'googleadservices.com'  and server_hostname <> 'app-measurement.com' and server_hostname <> 'gstatic.com' and server_hostname <> 'googlesyndication.com' and server_hostname <> 'google-analytics.com'  and server_hostname <> 'googleusercontent.com'  and server_hostname <> 'ggpht.com'  and server_hostname <> 'googletagmanager.com' and server_hostname is not null group by subscriber, server_hostname)temp',
         **properties)
                .load()).show()

pyspark SQL逻辑合取

0 个答案: