我使用Spark 1.6.2在HDP 2.4.3上运行Vora 1.3。
我有两个包含相同模式数据的表,一个表位于HANA数据库中,另一个表存储为HDFS中的CSV文件。
我使用Zeppelin在Vora中创建了两个表:
import pandas as pd
from bokeh.io import output_file, show
from bokeh.plotting import figure
output_file("signal.html")
data = pd.DataFrame(dict(
time=[1, 1.1, 1.2, 1.5, 1.8],
down=[19371, None, None, 38175, None],
up=[None, 36823, 91046, None, 47722]
))
data['mapped'] = data.up.isnull()
# This computes the "step" data
x, y = [], []
prev = -1
for index, row in data.iterrows():
if row.mapped != prev and prev>=0:
x.append(row.time)
y.append(prev)
x.append(row.time)
y.append(int(row.mapped))
prev = int(row.mapped)
p = figure()
p.line(x=x, y=y, legend="signal")
p.circle(x=data.time, y=data.mapped, legend="signal")
p.legend.click_policy="hide"
show(p)
Q1。顺便说一下,从文件源创建Vora表时,何时可以提供目录名,而不是列出目录中的所有文件?这是非常不切实际的,因为无法预测目录中将有多少部分文件。
CREATE TABLE flights_2006 (Year int, Month_ int, DayofMonth int, DayOfWeek int, DepTime int, CRSDepTime int, ArrTime int, CRSArrTime int, UniqueCarrier string, FlightNum int,
TailNum string, ActualElapsedTime int, CRSElapsedTime int, AirTime int, ArrDelay int, DepDelay int, Origin string, Dest string, Distance int, TaxiIn int, TaxiOut int,
Cancelled int, CancellationCode int, Diverted int, CarrierDelay int, WeatherDelay int, NASDelay int, SecurityDelay int, LateAircraftDelay int)
USING com.sap.spark.vora
OPTIONS (
files "/exch/flights_filtered/part-00000,/exch/flights_filtered/part-00001,/exch/flights_filtered/part-00002,/exch/flights_filtered/part-00003,/exch/flights_filtered/part-00004",
csvdelimiter ","
)
我能够从表连接中为这两个产生结果(留出这种连接的商业意义):
CREATE TABLE flights_2007
USING com.sap.spark.hana
OPTIONS (
tablepath "XXXXXXXXXXXX",
dbschema "XXXXXXXXXX",
host "XXXXXXXXXXX",
instance "00",
user "XXXXXXXXXXX",
passwd "XXXXXXXXXX"
)
然后我尝试在Vora Modeler中执行相同的步骤。
Q2。 Zeppelin中的REGISTER TABLE如何导致Vora Modeler中没有表格?
所以,我在Vora Modeler中执行了相同的两个表创建语句,使用表名中的所有大写字母,因为我记得Vora早先有一些问题。然后使用以下条件创建了一个Vora View作为两个表的连接:
select f7.MONTH, f7.DAYOFMONTH, f7.UNIQUECARRIER, f7.FLIGHTNUM, f7.YEAR, f7.DEPTIME, f6.year, f6.DepTime
from flights_2007 as f7 inner join flights_2006 as f6
on f7.MONTH = f6.Month_ and f7.DAYOFMONTH = f6.DayofMonth and f7.UNIQUECARRIER = f6.UniqueCarrier and f7.FLIGHTNUM = f6.FlightNum
where f7.MONTH = 1 and f7.DAYOFMONTH = 2 and f7.UNIQUECARRIER = 'WN'
..并使用where-condition:
FLIGHTS_2007.MONTH = FLIGHTS_2006.MONTH_ and
FLIGHTS_2007.DAYOFMONTH = FLIGHTS_2007.DAYOFMONTH and
FLIGHTS_2007.UNIQUECARRIER = FLIGHTS_2006.UNIQUECARRIER and
FLIGHTS_2007.FLIGHTNUM = FLIGHTS_2006.FLIGHTNUM
该视图预览的预期结果与基于Zeppelin的选择相同。实际结果(前几行):
FLIGHTS_2007.MONTH = 1 and
FLIGHTS_2007.DAYOFMONTH = 2 and
FLIGHTS_2007.UNIQUECARRIER = 'WN'
Q3。我在Vora Modeler中做错了什么?或者它实际上是一个错误?
答案 0 :(得分:1)
您提到在运行CREATE语句时使用了表名的全部大写。根据我使用1.3 Modeler的经验,您还必须使用全部大写的列名。
架构错误:无法解析列“FLIGHTS_2006”。“年”
例如,如果您使用“CREATE TABLE FLIGHTS_2006(Year int,...”),请尝试将其更改为“CREATE TABLE FLIGHTS_2006(YEAR int,...”
答案 1 :(得分:0)
关于你的Q1,是的,这是目前正在审查的功能请求。
关于你的Q2,你的Zeppelin是否与你的Vora Modeler(又名Vora工具)连接到同一个Vora Thrift服务器?
关于你的Q3,Ryan的另一个回复是正确的,Vora 1.3中的列名也是大小写的