spark1.6,从我的Vertica数据库中检索数据以对其进行处理,以下查询在vertica db上运行良好,但在pyspark上却不起作用,Spark DataFrames支持JDBC源的谓词下推,但术语谓词为在严格的SQL含义中使用。这意味着它仅涵盖WHERE子句。此外,它似乎仅限于逻辑连接(恐怕没有IN和OR)和简单谓词,它显示此错误: java.lang.RuntimeException:未指定选项'dbtable' >
conf = (SparkConf()
.setAppName("hivereader")
.setMaster("yarn-client")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.shuffle.service.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true")
.set("spark.executor.cores" , 7)
.set("spark.sql.inMemoryStorage.compressed", "true")
.set("spark.sql.shuffle.partitions" , 2000)
.set("spark.sql.tungsten.enabled" , 'true')
.set("spark.port.maxRetries" , 200))
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
url = "*******"
properties = {"user": "*****", "password": "*******", "driver": "com.vertica.jdbc.Driver" }
df = sqlContext.read.format("JDBC").options(
url = url,
query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'",
**properties
).load()
df.show()
答案 0 :(得分:0)
问题是,即使您说此查询对Vertica起作用,您的查询也不是Vertica可以识别的SQL语法编写的。您的查询应重写为:
SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION
FROM traffic.stats
WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'
修复所有这些错误后,您的sqlContext.read
方法应如下所示。
df = sqlContext.read.format("JDBC").options(
url = url,
query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'",
**properties
).load()
df.show()
或者您可以将该表作为子查询的别名,并使用dbtable
代替query
。
df = sqlContext.read.format("JDBC").options(
url = url,
dbtable = "(SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29') temp",
**properties
).load()
df.show()