Question

我在一些数据上编写了MR算法来创建数据结构。创建后我需要回答一些问题。为了更快地回答这些查询，我从结果中创建了一个元数据（大约几MB）。

现在我的问题是：

是否可以在主节点的内存中创建此元数据以避免文件I / O因此更快地回答查询？

Answer 1

假设，根据对其他答案的OP响应，另一个MR作业将需要元数据。在这种情况下使用分布式缓存非常简单：

在驱动程序类中：

public class DriverClass extends Configured{

  public static void main(String[] args) throws Exception {

    /* ...some init code... */


    /*
    * Instantiate a Job object for your job's configuration.  
    */
    Configuration job_conf = new Configuration();
    DistributedCache.addCacheFile(new Path("path/to/your/data.txt").toUri(),job_conf);
    Job job = new Job(job_conf);

    /* ... configure and start the job... */

  }
}

在mapper类中，您可以在设置阶段读取数据并使其可用于地图类：

public class YourMapper extends Mapper<LongWritable, Text, Text, Text>{

  private List<String> lines = new ArrayList<String>();

  @Override
  protected void setup(Context context) throws IOException,
      InterruptedException {

    /* Get the cached archives/files */
    Path[] cached_file = new Path[0];
    try {
      cached_file = DistributedCache.getLocalCacheFiles(context.getConfiguration());
    } catch (IOException e1) {
      // TODO add error code
      e1.printStackTrace();
    }
    File f = new File (cached_file[0].toString());
    try {
      /* Read the data some thing like: */
      lines = Files.readLines(f,charset);
    } catch (IOException e) {

      e.printStackTrace();
    }
  }


  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {

      /*
      * In the mapper - use the data as needed
      */

  }
}

请注意，分布式缓存可以容纳更多纯文本文件。您可以使用存档（zip，tar ..）甚至是完整的Java类（jar文件）。

另请注意，在较新的Hadoop实现中，分布式缓存API可在Job类本身中找到。请参阅this API和this answer。

在主节点中维护数据结构

1 个答案: