我正在尝试设置一个伪分布式Hadoop 2.6群集来运行Giraph作业。由于我无法找到一个全面的指南,我一直依赖Giraph QuickStart(http://giraph.apache.org/quick_start.html),不幸的是Hadoop 0.20.203.0,以及一些Hadoop 2.6 / YARN教程。为了做正确的事,我想出了一个应该安装Hadoop和Giraph的bash脚本。不幸的是,Giraph的工作一再失败,因为工作人员在输入拆分期间失败了#39;例外。如果有人能够在部署过程中指出错误或提供其他工作方式,我将非常感激。
编辑:我的主要目标是能够开发Giraph 1.1工作。我不需要自己运行任何繁重的计算(最终,作业将在外部集群上运行),所以如果有更简单的方法来构建Giraph开发环境,它就可以。
安装脚本如下:
#! /bin/bash
set -exu
echo "Starting hadoop + giraph installation; JAVA HOME is $JAVA_HOME"
INSTALL_DIR=~/apache_hadoop_giraph
mkdir -p $INSTALL_DIR/downloads
############# PHASE 1: YARN ##############
#### 1: Get and unpack Hadoop:
if [ ! -f $INSTALL_DIR/downloads/hadoop-2.6.0.tar.gz ]; then
wget -P $INSTALL_DIR/downloads ftp://ftp.task.gda.pl/pub/www/apache/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
fi
tar -xf $INSTALL_DIR/downloads/hadoop-2.6.0.tar.gz -C $INSTALL_DIR
export HADOOP_PREFIX=$INSTALL_DIR/hadoop-2.6.0
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
#### 2: Configure Hadoop and YARN
sed -i -e "s|^export JAVA_HOME=\${JAVA_HOME}|export JAVA_HOME=$JAVA_HOME|g" ${HADOOP_PREFIX}/etc/hadoop/hadoop-env.sh
cat <<EOF > ${HADOOP_PREFIX}/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
EOF
cat <<EOF > ${HADOOP_PREFIX}/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
EOF
cat <<EOF > ${HADOOP_PREFIX}/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
EOF
cat <<EOF > ${HADOOP_PREFIX}/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
EOF
#### 3: Prepare HDFS:
cd $HADOOP_PREFIX
export HDFS=$HADOOP_PREFIX/bin/hdfs
sbin/stop-all.sh # Just to be sure we have no running demons
# The following line is commented out in case some of SO readers have something important in /tmp:
# rm -rf /tmp/* || echo "removal of some parts of tmp failed"
$HDFS namenode -format
sbin/start-dfs.sh
#### 4: Create HDFS directories:
$HDFS dfs -mkdir -p /user
$HDFS dfs -mkdir -p /user/`whoami`
#### 5 (optional): Run a test job
sbin/start-yarn.sh
$HDFS dfs -put etc/hadoop input
bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'
$HDFS dfs -cat output/* # Prints some stuff grep'd out of input file
sbin/stop-yarn.sh
#### 6: Stop HDFS for now
sbin/stop-dfs.sh
############# PHASE 2: Giraph ##############
#### 1: Get Giraph 1.1
export GIRAPH_HOME=$INSTALL_DIR/giraph
cd $INSTALL_DIR
git clone http://git-wip-us.apache.org/repos/asf/giraph.git giraph
cd $GIRAPH_HOME
git checkout release-1.1
#### 2: Build
mvn -Phadoop_2 -Dhadoop.version=2.6.0 -DskipTests package
#### 3: Run a test job:
# Remove leftovers if any:
$HADOOP_HOME/sbin/start-dfs.sh
$HDFS dfs -rm -r -f /user/`whoami`/output
$HDFS dfs -rm -r -f /user/`whoami`/input/tiny_graph.txt
$HDFS dfs -mkdir -p /user/`whoami`/input
# Place input:
$HDFS dfs -put tiny_graph.txt input/tiny_graph.txt
# Start YARN
$HADOOP_HOME/sbin/start-yarn.sh
# Run the job (this fails with 'Worker failed during input split'):
JAR=$GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-for-hadoop-2.6.0-jar-with-dependencies.jar
CORE=$GIRAPH_HOME/giraph-core/target/giraph-1.1.0-for-hadoop-2.6.0-jar-with-dependencies.jar
$HADOOP_HOME/bin/hadoop jar $JAR \
org.apache.giraph.GiraphRunner \
org.apache.giraph.examples.SimpleShortestPathsComputation \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip /user/ptaku/input/tiny_graph.txt \
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op /user/ptaku/output/shortestpaths \
-yj $JAR,$CORE \
-w 1 \
-ca giraph.SplitMasterWorker=false
脚本可以顺利运行到最后一个命令,该命令会以map 100% reduce 0%
状态挂起很长时间;调查YARN容器的日志文件揭示了神秘的java.lang.IllegalStateException: coordinateVertexInputSplits: Worker failed during input split (currently not supported)
。 pastebin提供了完整的容器日志:
容器1(主):http://pastebin.com/6nYvtNxJ
容器2(工人):http://pastebin.com/3a6CQamQ
我还尝试使用hadoop_yarn
配置文件构建Giraph(从pom.xml中删除STATIC_SASL_SYMBOL之后),但它并没有改变任何内容。
我正在运行Ubuntu 14.10 64bit,4GB RAM和16GB交换。额外的系统信息:
>> uname -a
Linux Graffi 3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 01:58:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>> which java
/usr/bin/java
>> java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
>> echo $JAVA_HOME
/usr/lib/jvm/java-7-openjdk-amd64/jre
>> which mvn
/usr/bin/mvn
>> mvn --version
Apache Maven 3.0.5
Maven home: /usr/share/maven
Java version: 1.7.0_75, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-7-openjdk-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.13.0-35-generic", arch: "amd64", family: "unix"
对于如何在Hadoop 2.6上运行Giraph 1.1,我将非常感激。
答案 0 :(得分:1)
前一段时间我遇到过类似的问题。问题是我的计算机在主机名中有大写字母,这是一个已知错误(https://issues.apache.org/jira/browse/GIRAPH-904)。将主机名更改为仅小写字母为我修复了它。