使用Eclipse开发,测试和调试Hadoop map / reduce作业

时间:2012-06-13 02:21:54

标签: eclipse debugging maven hadoop mapreduce

在Eclipse中开发Java Map Reduce作业有哪些选择?我的最终目标是在我的亚马逊Hadoop集群上运行我开发的map / reduce逻辑,但我想首先在本地计算机上测试逻辑,然后在将其部署到更大的集群之前放入断点。

我看到有一个用于Eclipse的Hadoop插件看起来很旧(如果我错了,请纠正我)和一个名为Karmasphere的公司有ecplise和Hadoop的东西,但我不确定它是否仍然可用。

如何使用Eclipse开发,测试和调试map / reduce作业?

3 个答案:

答案 0 :(得分:4)

我通过以下方式在Eclipse中开发Cassandra / Hadoop应用程序:

  1. 使用maven(m2e)为我的Eclipse项目收集和配置依赖项(Hadoop,Cassandra,Pig等)

  2. 创建测试用例(src / test / java中的类)来测试我的映射器和缩减器。诀窍是使用扩展RecordWriter和StatusReporter的内部类动态构建上下文对象。如果执行此操作,则在调用setup / map / cleanup或setup / reduce / cleanup之后,您可以断言正确的键/值对,并由mapper或reducer写入上下文信息。 mapred和mapreduce中上下文的构造函数看起来很难看,但是你会发现这些类很容易实例化。

  3. 一旦你编写了这些测试,maven会在你每次构建时自动调用它们。

  4. 您可以通过选择项目并执行Run - >手动调用测试。 Maven测试。事实证明这非常方便,因为测试是在调试模式下调用的,您可以在映射器和缩减器中设置断点,并执行Eclipse允许您在调试中执行的所有操作。

  5. 一旦你对代码的质量感到满意,就可以使用Maven为一个jar中的所有内容构建一个jar-with-dependencies,这个jar就像hadoop一样。

  6. 作为旁注,我已经在Eclipse中构建了许多基于M2T JET项目的代码生成工具。它们为我上面提到的所有内容生成了基础结构,我只是为我的映射器,缩减器和测试用例编写逻辑。我想如果你给它一些想法,你可能会想出一组可重用的类,你可以扩展它们做同样的事情。

    以下是一个示例测试用例类:

    /*
     * 
     * This source code and information are provided "AS-IS" without 
     * warranty of any kind, either expressed or implied, including
     * but not limited to the implied warranties of merchantability
     * and/or fitness for a particular purpose.
     * 
     * This source code was generated using an evaluation copy 
     * of the Cassandra/Hadoop Accelerator and may not be used for
     * production purposes.
     *
     */
    package com.creditco.countwords.ReadDocs;
    
    // Begin imports 
    
    import java.io.IOException;
    import java.util.ArrayList;
    
    import junit.framework.TestCase;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Counter;
    import org.apache.hadoop.mapreduce.Counters;
    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.JobContext;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.OutputCommitter;
    import org.apache.hadoop.mapreduce.RecordReader;
    import org.apache.hadoop.mapreduce.RecordWriter;
    import org.apache.hadoop.mapreduce.StatusReporter;
    import org.apache.hadoop.mapreduce.TaskAttemptContext;
    import org.apache.hadoop.mapreduce.TaskAttemptID;
    import org.junit.Test;
    
    // End imports 
    
    public class ParseDocsMapperTest extends TestCase {
    
        @Test
        public void testCount() {
    
            TestRecordWriter    recordWriter    = new TestRecordWriter();
            TestRecordReader    recordReader    = new TestRecordReader();
            TestOutputCommitter outputCommitter = new TestOutputCommitter();
            TestStatusReporter  statusReporter  = new TestStatusReporter();
            TestInputSplit      inputSplit      = new TestInputSplit();
    
            try {
    
                    // Begin test logic
    
    
                    // Get an instance of the mapper to be tested and a context instance
                ParseDocsMapper mapper = new ParseDocsMapper();
    
                Mapper<LongWritable,Text,Text,IntWritable>.Context context = 
                    mapper.testContext(new Configuration(), new TaskAttemptID(),recordReader,recordWriter,outputCommitter,statusReporter,inputSplit);
    
                    // Invoke the setup, map and cleanup methods
                mapper.setup(context);
    
                LongWritable key = new LongWritable(30);
                Text value = new Text("abc def ghi");
    
                mapper.map(key, value, context);
    
                if (recordWriter.getKeys().length != 3) {
                    fail("com.creditco.countwords:ParseDocsMapperTest.testCount() - Wrong number of records written ");
                }
                mapper.cleanup(context);
    
                    // Validation:
                    //
                    // recordWriter.getKeys() returns the keys written to the context by the mapper
                    // recordWriter.getValues() returns the values written to the context by the mapper
                    // statusReporter returns the most recent status and any counters set by the mapper
                    //
    
                    // End test logic
    
            } catch (Exception e) {
                fail("com.creditco.countwords:ParseDocsMapperTest.testCount() - Exception thrown: "+e.getMessage());
            }
    
        }
    
        final class TestRecordWriter extends RecordWriter<Text, IntWritable> {
            ArrayList<Text> keys = new ArrayList<Text>();
            ArrayList<IntWritable> values = new ArrayList<IntWritable>();
            public void close(TaskAttemptContext arg0) throws IOException, InterruptedException { }
            public void write(Text key, IntWritable value) throws IOException, InterruptedException {
                keys.add(key);
                values.add(value);
            }
            public Text[] getKeys() {
                Text result[] = new Text[keys.size()];
                keys.toArray(result);
                return result;
            }
            public IntWritable[] getValues() {
                IntWritable[] result = new IntWritable[values.size()];
                values.toArray(result);
                return result;
            }
        };  
    
        final class TestRecordReader extends RecordReader<LongWritable, Text> {
            public void close() throws IOException { }
            public LongWritable getCurrentKey() throws IOException, InterruptedException {
                throw new RuntimeException("Tried to call RecordReader:getCurrentKey()");
            }
            public Text getCurrentValue() throws IOException, InterruptedException {
                throw new RuntimeException("Tried to call RecordReader:getCurrentValue()");
            }
            public float getProgress() throws IOException, InterruptedException {
                throw new RuntimeException("Tried to call RecordReader:getProgress()");
            }
            public void initialize(InputSplit arg0, TaskAttemptContext arg1) throws IOException, InterruptedException { }
            public boolean nextKeyValue() throws IOException, InterruptedException {
                return false;
            }
        };
    
        final class TestStatusReporter extends StatusReporter {
            private Counters counters = new Counters();
            private String status = null;
            public void setStatus(String arg0) {
                status = arg0;
            }
            public String getStatus() {
                return status;
            }
            public void progress() { }
            public Counter getCounter(String arg0, String arg1) {
                return counters.getGroup(arg0).findCounter(arg1);
            }
            public Counter getCounter(Enum<?> arg0) {
                return null;
            }
        };
    
        final class TestInputSplit extends InputSplit {
            public String[] getLocations() throws IOException, InterruptedException {
                return null;
            }
            public long getLength() throws IOException, InterruptedException {
                return 0;
            }
        };
    
        final class TestOutputCommitter extends OutputCommitter {
            public void setupTask(TaskAttemptContext arg0) throws IOException { }
            public void setupJob(JobContext arg0) throws IOException { }
            public boolean needsTaskCommit(TaskAttemptContext arg0) throws IOException {
                return false;
            }
            public void commitTask(TaskAttemptContext arg0) throws IOException { }
            public void cleanupJob(JobContext arg0) throws IOException { }
            public void abortTask(TaskAttemptContext arg0) throws IOException { }
        };
    
    }
    

    这是一个样本maven pom。请注意,引用的版本有点过时,但只要这些版本保存在某个maven存储库中,您就可以构建此项目。

    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
      <groupId>com.creditco</groupId>
      <artifactId>wordcount.example</artifactId>
      <version>0.0.1-SNAPSHOT</version>
        <build>
            <plugins>
                <plugin>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <version>2.2</version>
                    <configuration>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                    </configuration>
                </plugin>
            </plugins>
        </build>
      <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>0.20.2</version>
            <type>jar</type>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.cassandra</groupId>
            <artifactId>cassandra-all</artifactId>
            <version>1.0.6</version>
            <type>jar</type>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.cassandraunit</groupId>
            <artifactId>cassandra-unit</artifactId>
            <version>1.0.1.1</version>
            <type>jar</type>
            <scope>compile</scope>
            <exclusions>
                <exclusion>
                    <artifactId>hamcrest-all</artifactId>
                    <groupId>org.hamcrest</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.pig</groupId>
            <artifactId>pig</artifactId>
            <version>0.9.1</version>
            <type>jar</type>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20090211</version>
            <type>jar</type>
            <scope>compile</scope>
        </dependency>
      </dependencies>
    </project>
    

答案 1 :(得分:0)

我使用Apache附带的MiniMRCluster集群。您可以在单元测试中启动迷你Map Reduce集群! HBase也有HBaseTestingUtil,因为你可以用大约两行启动HDFS和MapReduce。

答案 2 :(得分:0)

@Chris Gerken - 我试图通过将Driver作为Java应用程序运行在Eclipse中运行Word Count作业,但我在Mapper上得到了ClassNotFoundException。在我看来,作为一个java应用程序运行,hadoop job没有得到所需的Mapper和Reduce与jar。