Question

我正在编写map-reduce程序来查询cassandra column-family。我只需要从一个列族读取行的子集（使用行键）。我有一组行的行键，我必须阅读。如何将“行键集”传递给map reduce作业，以便它只能输出cassandra columnfamily中那些行的子集？

摘要：

enter code here

  class GetRows()
  {
   public set<String> getRowKeys()
   {
     logic.....
     return set<string>;
   }
  }


  class MapReduceCassandra()
  {
    inputformat---columnFamilyInputFormat
     .
     ;
    also need input key-set .. How to get it?
  }

任何人都可以建议从java应用程序调用mapreduce以及如何将一组键传递给mapreduce的最佳方法吗？

Answer 1

从Java调用map reduce

为此，您可以使用org.apache.hadoop.mapreduce命名空间中的类（您可以使用非常类似的方法使用较旧的mapred，只需从您的Java应用程序中检查API文档）：

Job job = Job.getInstance(new Configuration());
// configure job: set input and output types and directories, etc.

job.setJarByClass(MapReduceCassandra.class);
job.submit();

将数据传递到mapreduce作业

如果您的行键集非常小，您可以将其序列化为字符串，并将其作为配置参数传递：

job.getConfiguration().set("CassandraRows", getRowsKeysSerialized()); // TODO: implement serializer

//...

job.submit();

nside你可以通过上下文对象访问参数的工作：

public void map(
    IntWritable key,  // your key type
    Text value,       // your value type
    Context context
)
{
    // ...

    String rowsSerialized = context.getConfiguration().get("CassandraRows");
    String[] rows = deserializeRows(rowsSerialized);  // TODO: implement deserializer

    //...
}

但是，如果您的集合可能无限制，将其作为参数传递将是一个坏主意。相反，您应该将密钥传递到文件中，并利用分布式缓存。然后，您可以在提交作业之前将此行添加到上面的部分：

job.addCacheFile(new Path(pathToCassandraKeySetFile).toUri());

//...

job.submit();

在作业中，您将能够通过上下文对象访问此文件：

public void map(
    IntWritable key,  // your key type
    Text value,       // your value type
    Context context
)
{
    // ...

    URI[] cacheFiles = context.getCacheFiles();

    // find, open and read your file here

    // ...
}

注意：所有这些都适用于新API（org.apache.hadoop.mapreduce）。如果您使用的是org.apache.hadoop.mapred，则该方法非常相似，但会在不同的对象上调用一些相关的方法。

如何将多个输入格式文件传递给map-reduce作业？

1 个答案: