在PySpark

时间:2019-02-14 20:22:58

标签: apache-spark pyspark pyspark-sql vertica

我有运行在Python 3.4上的Spark 1.6,可从Vertica数据库中检索数据以在下面的查询中对其进行处理,Spark DataFrames支持JDBC源的谓词下推,但在严格的SQL含义中使用了谓词。这意味着它仅涵盖WHERE子句。此外,它似乎仅限于逻辑连接(恐怕没有IN和OR)和简单谓词,它显示此错误:java.lang.RuntimeException:未指定选项'dbtable'

数据库包含大约1000亿的海量数据,但我无法检索数据 并且spark1.6不允许我仅将dbtable查询用作schema.table,并且出现以下错误:

java.lang.RuntimeException: Option 'dbtable' not specified
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

url = "*******"
properties = {"user": "*****", "password": "*******", "driver": "com.vertica.jdbc.Driver" }

df = sqlContext.read.format("JDBC").options(
    url = url,
    query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE DATE(time_stamp) between '2019-01-25' AND '2019-01-29'",
    **properties
).load()

df.show()

我尝试了以下查询,但没有使用限制功能花费很长时间的结果

query = "SELECT date(time_stamp) AS DATE, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, (bytes_in) AS DOWNLINK, (bytes_out) AS UPLINK,(connections_out) AS CONNECTION FROM traffic.stats WHERE date(time_stamp) between '2019-01-27' AND '2019-01-29'"
df = sqlContext.read.format("JDBC").options(
    url = url,
    dbtable="( " + query + " ) as temp",
    **properties
).load()

是否可以像上面那样读取数据或通过特定查询将其读取为数据帧?

我试图通过设置更多的条件和限制来减少时间,但是在$ \ conditions上它拒绝了,即使删除条件也给了我“ FROM中的子查询必须具有别名”,这是查询:

SELECT min(date(time_stamp)) AS mindate,max(date(time_stamp)) AS maxdate,count (distinct date(time_stamp)) AS noofdays, (subscriber) AS IMSI, (server_hostname) AS WEBSITE, sum(bytes_in) AS DL, sum(bytes_out) AS UL, sum(connections_out) AS conn from traffic.stats where SUBSCRIBER like '41601%' and date(time_stamp) between '2019-01-25' and '2019-01-29'and signature_service_category = 'Web Browsing' and (signature_service_name = 'SSL v3' or signature_service_name = 'HTTP2 over TLS') and server_hostname not like '%.googleapis.%' and server_hostname not like '%.google.%' and server_hostname <> 'doubleclick.net'  and server_hostname <> 'youtube.com'  and server_hostname <> 'googleadservices.com'  and server_hostname <> 'app-measurement.com' and server_hostname <> 'gstatic.com' and server_hostname <> 'googlesyndication.com' and server_hostname <> 'google-analytics.com'  and server_hostname <> 'googleusercontent.com'  and server_hostname <> 'ggpht.com'  and server_hostname <> 'googletagmanager.com' and server_hostname is not null group by subscriber, server_hostname

1 个答案:

答案 0 :(得分:1)

如果查询要花一个多小时在日期范围之间进行过滤,则应考虑编写投影。

CREATE PROJECTION traffic.status_date_range
(
  time_stamp,
  subscriber,
  server_hostname,
  bytes_in,
  bytes_out,
  connections_out
)
AS
  SELECT
    time_stamp,
    subscriber,
    server_hostname,
    bytes_in,
    bytes_out,
    connections_out
  FROM traffic.stats
  ORDER BY time_stamp
SEGMENTED BY HASH(time_stamp) ALL NODES;

像这样创建查询特定的投影可能会增加大量的磁盘空间,但是如果性能对您而言确实很重要,那么这可能是值得的。

如果您还没有这样做,我还建议对表进行分区。根据您traffic.stats表中有多少个不同的日期,您可能不想按天分区。每个分区至少创建一个ROS容器(有时更多)。因此,如果您有1024个或更多个不同的日期,那么Vertica甚至不会按日期进行分区,在这种情况下,您可以按月份进行分区。如果您使用的是Vertica 9,则可以利用层次划分(可以了解有关here的信息)。

我会提醒您,在运行ALTER TABLE语句以添加分区子句后重组表将需要大量磁盘空间,因为Vertica会将数据写入新文件。完成后,该表将占用与现在几乎相同的空间,但是在进行分区时,您的磁盘空间可能会变得非常大。