我有运行在Python 3.4上的Spark 1.6,可从Vertica数据库中检索数据以在下面的查询中对其进行处理,Spark DataFrames支持JDBC源的谓词下推,但在严格的SQL含义中使用了谓词。这意味着它仅涵盖WHERE子句。此外,它似乎仅限于逻辑连接(恐怕没有IN和OR)和简单谓词,它显示此错误:java.lang.RuntimeException:未指定选项'dbtable'
数据库包含大约1000亿的海量数据,但我无法检索数据 并且spark1.6不允许我仅将dbtable查询用作schema.table,并且出现以下错误:
java.lang.RuntimeException: Option 'dbtable' not specified
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
url = "*******"
properties = {"user": "*****", "password": "*******", "driver": "com.vertica.jdbc.Driver" }
df = sqlContext.read.format("JDBC").options(
url = url,
query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'",
**properties
).load()
df.show()
我尝试了以下查询,但没有使用限制功能花费很长时间的结果
query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE date(time_stamp) between '2019-01-27' AND '2019-01-29'"
df = sqlContext.read.format("JDBC").options(
url = url,
dbtable="( " + query + " ) as temp",
**properties
).load()
是否可以像上面那样读取数据或通过特定查询将其读取为数据帧?
我试图通过设置更多的条件和限制来减少时间,但是在$ \ conditions上它拒绝了,即使删除条件也给了我“ FROM中的子查询必须具有别名”,这是查询:>
SELECT min(date(time_stamp)) AS mindate,max(date(time_stamp)) AS maxdate,count (distinct date(time_stamp)) AS noofdays, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, sum(bytes_in) AS DL, sum(bytes_out) AS UL, sum(connections_out) AS conn from traffic.stats where SUBSCRIBER like '41601%' and date(time_stamp) between '2019-01-25' and '2019-01-29'and signature_service_category = 'Web Browsing' and (signature_service_name = 'SSL v3' or signature_service_name = 'HTTP2 over TLS') and server_hostname not like '%.googleapis.%' and server_hostname not like '%.google.%' and server_hostname <> 'doubleclick.net' and server_hostname <> 'youtube.com' and server_hostname <> 'googleadservices.com' and server_hostname <> 'app-measurement.com' and server_hostname <> 'gstatic.com' and server_hostname <> 'googlesyndication.com' and server_hostname <> 'google-analytics.com' and server_hostname <> 'googleusercontent.com' and server_hostname <> 'ggpht.com' and server_hostname <> 'googletagmanager.com' and server_hostname is not null group by subscriber, server_hostname
答案 0 :(得分:1)
如果查询要花一个多小时在日期范围之间进行过滤,则应考虑编写投影。
CREATE PROJECTION traffic.status_date_range
(
time_stamp,
subscriber,
server_hostname,
bytes_in,
bytes_out,
connections_out
)
AS
SELECT
time_stamp,
subscriber,
server_hostname,
bytes_in,
bytes_out,
connections_out
FROM traffic.stats
ORDER BY time_stamp
SEGMENTED BY HASH(time_stamp) ALL NODES;
像这样创建查询特定的投影可能会增加大量的磁盘空间,但是如果性能对您而言确实很重要,那么这可能是值得的。
如果您还没有这样做,我还建议对表进行分区。根据您traffic.stats表中有多少个不同的日期,您可能不想按天分区。每个分区至少创建一个ROS容器(有时更多)。因此,如果您有1024个或更多个不同的日期,那么Vertica甚至不会按日期进行分区,在这种情况下,您可以按月份进行分区。如果您使用的是Vertica 9,则可以利用层次划分(可以了解有关here的信息)。
我会提醒您,在运行ALTER TABLE
语句以添加分区子句后重组表将需要大量磁盘空间,因为Vertica会将数据写入新文件。完成后,该表将占用与现在几乎相同的空间,但是在进行分区时,您的磁盘空间可能会变得非常大。