用例输入：

歌曲文件 -

| Id |宋|输入|

| s1 | song1 |经典|
| s2 | song2 |爵士乐|
| s2 | song3 |经典|

用户评级文件 -

| User_Id | Song_Id |评级|

| U1 | S1 | 7 |
| U2 | S2 | 5 |
| U3 | S2 | 9 |
| U4 | S1 | 7 |
| U5 | S5 | 5 |
| U6 | S1 | 9 |

注意：这两个文件都包含非常大的数据。

用例说明：

查找每首经典歌曲的平均评分。

我提出的实际/预期解决方案是我将使用两个链式作业 1.Job1：它将获得所有古典歌曲的ID并添加以分发缓存

2.Job2：第二个作业中的Mapper根据缓存中的值过滤古典歌曲的等级。 Reducer将计算每首歌曲的平均评分。

我在网上搜索是否可以将作业的输出直接写入分布式缓存，但无法找到有用的信息。

我在stackoverflow上发现了类似的问题：

"How to directly send the output of a mapper-reducer to a another mapper-reducer without
 saving the output into the hdfs"

解决方法是使用'SequenceFileOutputFormat'。

但是在我的情况下，我希望所有的歌曲ID都可以在第二个作业中用于每个映射器。所以我认为上述解决方案在我的案例中不起作用。

我想要的替代方法是运行第一个作业，找到经典歌曲的ID并将输出（歌曲ID）写入文件并创建新作业并将歌曲ID输出文件添加到第二个工作的缓存。请指教。

非常感谢您的帮助。

Answer 1

如果eah记录的尺寸小于＆lt; 1mb

，您可以将中间结果更新为 MemCached

Answer 2

遵循第二种方法。

第一个作业会将输出写入文件系统。
第二个作业将使用Job API而不是DistributedCache API将所需文件传递到所有节点，该API已被弃用。

查看

Job

addCacheFile(URI uri)
getCacheFiles()

等

Answer 3

一种方法是将第一个作业的输出加载到分布式缓存中，然后启动第二个作业。

//CONFIGURATION

Job job = Job.getInstance(getConf(), "Reading from distributed cache and etc.");
job.setJarByClass(this.getClass());

////////////
FileSystem fs = FileSystem.get(getConf());

/*
 * if you have, for example, a map only job, 
 * that "something" could be "part-"
 */
FileStatus[] fileList = fs.listStatus(PATH OF FIRST JOB OUTPUT, 
                           new PathFilter(){
                                 @Override public boolean accept(Path path){
                                        return path.getName().contains("SOMETHING");
                                 } 
                            } );

for(int i=0; i < fileList.length; i++){ 
    DistributedCache.addCacheFile(fileList[i].getPath().toUri(), job.getConfiguration());
}

//other parameters

映射器：

//in mapper

@Override
public void setup(Context context) throws IOException, InterruptedException {

    //SOME STRUCT TO STORE VALUES READ (arrayList, HashMap..... whatever)
    Object store = null;

    try{
        Path[] fileCached = DistributedCache.getLocalCacheFiles(context.getConfiguration());

        if(fileCached != null && fileCached.length > 0) {
             for(Path file : fileCached) {
                readFile(file);
                }
        }
    } catch(IOException ex) {
        System.err.println("Exception in mapper setup: " + ex.getMessage());
    }

}

private void readFile(Path filePath) {

    try{
        BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath.toString()));
        String line = null;

        while((line = bufferedReader.readLine()) != null) {

            //reading line by line that file and updating our struct store
            //....

        } //end while (cycling over lines in file)

        bufferedReader.close();

    } catch(IOException ex) {
        System.err.println("Exception while reading file: " + ex.getMessage());
    }
} //end readFile method

现在在映射阶段，您已经将文件作为输入传递给作业，并将所需的值存储在结构store中。

我的回答来自How to use a MapReduce output in Distributed Cache。

如何编写映射的输出将作业直接减少到分布式缓存，以便将其传递给另一个作业

用例输入：

| Id |宋|输入|

| User_Id | Song_Id |评级|

用例说明：

3 个答案: