Question

我正在尝试将数据从S3存储桶加载到redshift表，表中有一列作为源ID，我想将源文件可用的文件夹名存储到该列中。

实际上我在S3存储桶中有多个文件夹，在每个文件夹中我有一个文件，我在redshift中使用copy命令将所有文件移植到同一个表中，所以要识别数据来自哪个文件夹，所以我需要存储文件夹名称以及Redshift表格中的数据，我在表格中有单独的列作为Source id。

任何人都可以帮助我。

Answer 1

如果您使用的是Redshift复制命令，那么除了导入每个文件夹的过程（例如作为临时表）之外别无选择，然后手动设置您恢复的文件夹的值。重复每个文件夹。

另一种选择是使用红移频谱并创建一个外部表格，将其作为分区映射到您的文件夹。

首先像这样创建基表

create external table spectrum.sales_part(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
partitioned by (saledate date)
row format delimited
fields terminated by '|'
stored as textfile
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/'
table properties ('numRows'='172000');

然后你就像这样添加分区

alter table spectrum.sales_part
add partition(saledate='2008-01-01') 
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/';
alter table spectrum.sales_part
add partition(saledate='2008-02-01') 
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';
alter table spectrum.sales_part
add partition(saledate='2008-03-01') 
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-03/';

一旦将其设置为外部表，就可以对该表使用标准sql，例如，您可以对该表运行查询或使用CTAS将其复制到永久性红移表。

以下是文档的链接 https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html

我想在将数据从S3存储桶复制到Redshift表时存储文件夹名称

1 个答案: