Question

我创建了一个存储在S3中的外部Hive（EMR上的1.0）表。我可以成功使用Hive将记录插入此表，查询它们，并直接从S3存储桶中提取文件作为验证。到目前为止，非常好。

我希望能够使用Pig（v0.14，也在EMR上）来读取和写入此逻辑表。使用HCatLoader（）加载工作正常，dump / explain确认我的数据和模式符合预期。

当我尝试用HCatStorer（）编写时，我遇到了问题。 Pig报告成功，写入N记录，但0字节。我在日志中看不到任何看似相关或表示问题的内容，也没有数据写入表/桶。

a = load 'myfile' as (foo: int, bar: chararray); // Just assume that this works. 
dump a; // Records are there
describe a; // Correct schema, as specified above
store a into 'mytable' using org.apache.hive.hcatalog.pig.HCatStorer();

输出（我再也没有包含其他问题的迹象）最后总结：

Success!

...

Input(s):
Successfully read 2 records (24235 bytes) from: "myfile"

Output(s):
Successfully stored 2 records in: "mytable"

Counters:
Total records written : 2
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

值得注意的是：

如果表位置在HDFS而不是S3中，对于外部和内部表，以及来自Hive或Pig，这都适用于同一环境。
我可以成功地直接存储到S3，例如store a into 's3n://mybucket/output' using PigStorage(',');
通过Hive shell插入相同的查询工作正常。

所以这似乎是Pig / HCatalog / S3作为堆栈的相互作用的问题;这些中的任何两个似乎一起工作正常。

鉴于我在Pig日志中没有看到任何有用的东西，我还应该看看还有什么可以调试它？我应该看看这些技术中是否有任何特定的配置参数？

Answer 1

我认为使用HCatalog从pig写入S3时会出现问题。由于最终输出数据正被写入_temporary文件，因此永远不会被复制/移动到原始位置。仅在S3上遇到这种奇怪的行为。

在我的情况下，输出应写入s3：// x / y /，但数据写入 S3：// X / Y /的 _temporary / attempt_1466700620679_0019_r_000000_0 /部分-R-00000

解决方法是将HCatalog的输出写入HDFS，然后写入S3。

您可以参考aws论坛上发布的以下链接： https://forums.aws.amazon.com/thread.jspa?threadID=230544

Pig通过HCatStorer（）向S3写入“成功”，写入0字节

1 个答案: