Question

我在s3路径中有很多项目，我尝试抓取（使用根路径s3://my-bucket/somedata/）

s3://my-bucket/somedata/20180101/data1/stuff.txt.gz
s3://my-bucket/somedata/20180101/data2/stuff.txt.gz
s3://my-bucket/somedata/20180101/data1.sql
s3://my-bucket/somedata/20180101/data2.sql  
s3://my-bucket/somedata/20180102/data1/stuff.txt.gz
s3://my-bucket/somedata/20180102/data2/stuff.txt.gz
...

有时我们的表是根据日期模式命名的（例如20180101）;有时它们是根据叶级别的文件夹来命名的。（例如data1），有时是文件（例如data1.sql），当有冲突时，Glue似乎只是在表名中添加唯一标识符（例如data1_c17b2f988649f2171b24b1d35da7f2b4）。

这里的逻辑是什么？这些名字是否具有确定性？我是否应该使用模式来构建我的数据，以便爬虫按照某种逻辑顺序对事物进行编目？

Answer 1

您需要标准化路径以正确获取名称，例如

s3://my-bucket/Customer/Customer_20180101/customer.csv 
s3://my-bucket/Customer/Customer_20180102/customer.csv 
s3://my-bucket/Customer/Customer_20180103/customer.csv 
s3://my-bucket/Customer/Customer_20180104/customer.csv 
s3://my-bucket/Customer/Customer_20180105/customer.csv

在s3上将抓取工具指向Customer文件夹后，将使用Glue抓取工具加载Customer表中的所有文件

AWS Glue Crawler命名约定

1 个答案: