从s3复制到hdfs时,distcp失败

时间:2018-05-26 15:56:20

标签: amazon-web-services hadoop amazon-s3 amazon-emr

创建了一个集群(Spark Amazon EMR)并尝试在命令行中运行。

CLI:

  

hadoop distcp s3a:// bucket / file1 / data

例外:

DECIMAL(5,3)

1 个答案:

答案 0 :(得分:0)

请在/etc/hadoop/conf/yarn-site.xml中检查yarn-site.xml的属性,

 <property>
  <name>yarn.nodemanager.aux-services</name> 
  <value>mapreduce_shuffle,spark_shuffle</value>
 </property>

 <property>
   <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> 
   <value>org.apache.spark.network.yarn.YarnShuffleService</value> 
 </property>

<property>
  <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> 
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

如果没有mapreduce_shuffle,请添加属性并重新启动yarn服务。

sudo stop hadoop-yarn-nodemanager
sudo start hadoop-yarn-nodemanager

我建议使用 s3-distcp 实用程序,因为它已在EMR群集中使用。

s3-dist-cp --src s3://my-tables/incoming/hourly_table --dest /data/hdfslocation/path

https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/