Question

我正在尝试从 databricks cluster 中的amazon s3 的分区数据创建一个表。现在我的数据在以下

分区

<body> <header></header> <main></main> <footer>© StackOverflow 2017</footer> </body>，ID和report

所以我安装了数据：

date

现在根据我的数据路径的结构将是这样的：

%python
ACCESS_KEY = "xxxxxxxxx"
SecretKey = "xxxxxxxxxx"
ENCODED_SECRET_KEY = SecretKey.replace("/", "%2F")
AWS_BUCKET_NAME = "path/parent_directory"
MOUNT_NAME = "parent"
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, 
AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)

现在，我想基于分区创建表。我想在create table中指定where条件，其中在条件中指定了report_name。 id文件夹中有5个报告。我的查询是这样的：

/dbfs/parent/id/report/date

我收到语法错误：

%sql
Create table if not exists abc
(col1 string,
 col2 string,
 col3 bigint)using parquet
OPTIONS (path "/mnt/parent/")
partitioned by (id,report,date) where 
report="report1" ;

我也试过

Error in SQL statement: ParseException:mismatched input 'where' expecting <EOF>

任何人都可以帮我吗？或者任何人都可以帮助我通过spark-shell加载？

由于

Answer 1

我认为你真正想要的是一个关于数据的非托管表和一个按该分区条件过滤的视图。

create table report
using parquet
options (
  path '/mnt/parent'
);

msck repair table report;

create or replace view report1
as select * from report where report = 'report1';

使用条件从分区的Parquet数据创建表

1 个答案: