verts sql的pyspark逻辑联合

时间:2019-02-13 19:48:21

标签: sql apache-spark pyspark vertica

spark1.6,从我的Vertica数据库中检索数据以对其进行处理,以下查询在vertica db上运行良好,但在pyspark上却不起作用,Spark DataFrames支持JDBC源的谓词下推,但术语谓词为在严格的SQL含义中使用。这意味着它仅涵盖WHERE子句。此外,它似乎仅限于逻辑连接(恐怕没有IN和OR)和简单谓词,它显示此错误: java.lang.RuntimeException:未指定选项'dbtable'

conf = (SparkConf()
.setAppName("hivereader")
.setMaster("yarn-client")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.shuffle.service.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true")
.set("spark.executor.cores" , 7)
.set("spark.sql.inMemoryStorage.compressed", "true")
.set("spark.sql.shuffle.partitions" , 2000)
.set("spark.sql.tungsten.enabled" , 'true')
.set("spark.port.maxRetries" , 200))

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

url = "*******"
properties = {"user": "*****", "password": "*******", "driver": "com.vertica.jdbc.Driver" }

df = sqlContext.read.format("JDBC").options(
    url = url,
    query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'",
    **properties
).load()

df.show()

1 个答案:

答案 0 :(得分:0)

问题是,即使您说此查询对Vertica起作用,您的查询也不是Vertica可以识别的SQL语法编写的。您的查询应重写为:

SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION
FROM traffic.stats
WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'

修复所有这些错误后,您的sqlContext.read方法应如下所示。

df = sqlContext.read.format("JDBC").options(
    url = url,
    query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'",
    **properties
).load()

df.show()

或者您可以将该表作为子查询的别名,并使用dbtable代替query

df = sqlContext.read.format("JDBC").options(
    url = url,
    dbtable = "(SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29') temp",
    **properties
).load()

df.show()