无法在MapReduce模式下使用Java运行嵌入式Pig

时间:2014-02-20 02:28:20

标签: hadoop configuration mapreduce apache-pig

我正在使用Pig 0.12.0和Hadoop 2.2.0。我已经在本地和地图缩减模式下成功地从grunt shell和pig batch脚本运行了pig。现在我正试图用Java中的嵌入式猪来运行猪。

话虽如此,我也成功地以本地模式运行嵌入式猪。但是,我遇到了在map reduce模式下运行嵌入式猪的问题。

问题是:成功编译类后,运行

时没有任何反应
    java -cp <classpath> PigMapRedMode

我后来看到有人说我应该在类路径中包含pig.properties。如

    fs.default.name=hdfs://<namenode-hostname>:<port>
    mapred.job.tracker=<jobtracker-hostname>:<port>

但是,在Hadoop 2.2.0中,JobTracker不再存在。有什么想法怎么办?

我附上了PigMapRedMode的Java代码,以防这里出现问题。

import java.io.IOException;
import org.apache.pig.PigServer;

public class PigMapRedMode {
    public static void main(String[] arg){
        try {
            PigServer pigServer = new PigServer("map reduce, (need to add properties file)");
            runIdQuery(pigServer, "5pts.txt");
        } catch (Exception e){
        }
    }

    public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
        pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(',');");
        pigServer.registerQuery("B = foreach A generate $0 as id;");
        pigServer.store("B", "id.out");
    }
}

更新

已找到解决方案!实际上,不需要在类路径中提供Properties对象或使用pig.properties,您所要做的就是在类路径中包含Hadoop配置目录:(对于我的Hadoop 2.2.0,它是/ etc / hadoop)和可以从该位置检索df.default.address和yarn.resourcemanager.address。

我在下面添加了修改过的java代码:

/**
 * Created by allenlin on 2/19/14.
 */
import java.io.IOException;
import java.util.Properties;

import org.apache.pig.ExecType;
import org.apache.pig.PigServer;


public class PigMapRedMode {
    public static void main(String[] arg){
        try {
            PigServer pigServer = new PigServer(ExecType.MAPREDUCE);
            runIdQuery(pigServer, "<hdfs input address>");
        } catch (Exception e){
        }
    }

    public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
        pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(',');");
        pigServer.registerQuery("B = foreach A generate $0 as id;");
        pigServer.store("B", "<hdfs output address>");
    }
}

我用来运行java类的Unix命令。请注意您需要包含的依赖项:

java -cp ".:$PIG_HOME/build/pig-0.12.1-SNAPSHOT.jar:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/mapreduce/*:antlr-runtime-3.4.jar:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/hdfs/*:$PIG_HOME/build/ivy/lib/Pig/*:$HADOOP_CONF_DIR" PigMapRedMode

感谢@zsxwing的帮助!

1 个答案:

答案 0 :(得分:0)

以下是我如何运行嵌入式猪

public class test1 {
public static void main(String[] args) {
 try {
        PigServer pigServer = new PigServer(ExecType.MAPREDUCE);
        runQuery(pigServer);
         Properties props = new Properties();
        props.setProperty("fs.default.name", "hdfs://localhost:9000");
}catch(Exception e) {
        e.printStackTrace();
    }
}
public static void runQuery(PigServer pigServer) {
    try {
        pigServer.registerQuery("input1 = LOAD '/input.data' as (line:chararray);");
        pigServer.registerQuery("words = foreach input1 generate FLATTEN(TOKENIZE(line)) as word;");
        pigServer.registerQuery("word_groups = group words by word;");
        pigServer.registerQuery("word_count = foreach word_groups generate group, COUNT(words);");
        pigServer.registerQuery("ordered_word_count = order word_count by group desc;");
        pigServer.registerQuery("store ordered_word_count into '/wct';");
    } catch(Exception e) {
        e.printStackTrace();
    }

  }
}

在eclipse中设置HADOOP_HOME

Run Configurations-->ClassPath-->User Entries-->Advanced-->Add ClassPath Variables-->New-->Name(HADOOP_HOME)-->Path(You Hadoop directory path)

我添加了Maven依赖项

<dependencies>
    <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.1</version>
</dependency>

<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.4</version>
</dependency>
<dependency>
    <groupId>log4j</groupId>
    <artifactId>log4j</artifactId>
    <version>1.2.16</version>
</dependency>
 <dependency>
    <groupId>org.apache.pig</groupId>
    <artifactId>pig</artifactId>
    <version>0.15.0</version>
</dependency>

<dependency>
    <groupId>org.antlr</groupId>
    <artifactId>antlr-runtime</artifactId>
    <version>3.4</version>
</dependency>
 </dependencies>

如果未正确设置HADOOP_HOME,则会出现以下错误

hadoop20.PigJobControl: falling back to default JobControl (not using hadoop 0.20 ?)