我正在3个VM(即1个主节点; 2个从属节点)上运行spark 1.6 核心,从我的Vertica数据库检索数据以对其进行处理 下面的查询在vertica db上运行良好,但是在 pyspark,Spark DataFrames支持JDBC谓词下推 源,但谓词一词在严格的SQL含义中使用。它的意思是 它仅涵盖WHERE子句。而且看起来它仅限于 逻辑连接(恐怕没有IN和OR)且简单 谓词,无论如何都可以按原样执行以下查询?
url = "*******"
properties = {
"user": "*****",
"password": "*******",
"driver": "com.vertica.jdbc.Driver"
}
df = (sqlContext.read.format("jdbc")
.options(url=url, dbtable='(SELECT min(date(time_stamp)) mindate,max(date(time_stamp)) maxdate,count (distinct date(time_stamp)) noofdays, subscriber, server_hostname, sum(bytes_in) DL, sum(bytes_out) UL, sum(connections_out) conn from traffic.stats where \$CONDITIONS and SUBSCRIBER like '41601%' and date(time_stamp) between '2019-01-25' and '2019-01-29'and signature_service_category = 'Web Browsing' and (signature_service_name = 'SSL v3' or signature_service_name = 'HTTP2 over TLS') and server_hostname not like '%.googleapis.%' and server_hostname not like '%.google.%' and server_hostname <> 'doubleclick.net' and server_hostname <> 'youtube.com' and server_hostname <> 'googleadservices.com' and server_hostname <> 'app-measurement.com' and server_hostname <> 'gstatic.com' and server_hostname <> 'googlesyndication.com' and server_hostname <> 'google-analytics.com' and server_hostname <> 'googleusercontent.com' and server_hostname <> 'ggpht.com' and server_hostname <> 'googletagmanager.com' and server_hostname is not null group by subscriber, server_hostname)temp',
**properties)
.load()).show()