1）简介/问题

Question

说我有如下输入：

(1,2)(2,1)(1,3)(3,2)(2,4)(4,1)

预计输出如下：

(1,(2,3,4)) -> (1,3) //second index is total friend #
(2,(1,3,4)) -> (2,3)
(3,(1,2))   -> (3,2)
(4,(1,2))   -> (4,2)

我知道如何在java中使用hashset执行此操作。但不知道这是如何使用mapreduce模型的。任何人都可以就此问题抛出任何想法或示例代码吗？我会很感激。

-------------------------------------------- ----------------------------------------

这是我的天真解决方案：1个映射器，2个减速器。 映射器将组织输入（1,2），（2,1），（1,3）;

将输出组织为

的的 *（1，HashSet的＆LT 2 - ），（2，HashSet的＆LT 1为卤素），（1，HashSet的＆LT 2 - ），（2 ，HashSet的＆LT 1为卤素），（1，HashSet的＆3; 1+），（3，HashSet的＆LT 1为卤素;） *

Reducer1 ：

将mapper的输出作为输入和输出：

*（1，hashset＆lt; 2,3＆gt;），（3，hashset＆lt; 1＆gt;）和（2，hashset＆lt; 1＆gt;）< EM> *

Reducer2 ：

将reducer1的输出作为输入输出：

*（1,2），（3,1）和（2,1） *

这只是我天真的解决方案。我不确定这是否可以通过hadoop的代码完成。

Answer 1

我认为应该有一种简单的方法来解决这个问题。

Mapper Input: (1,2)(2,1)(1,3)(3,2)(2,4)(4,1)

为每对发出两条记录，如下所示：

Mapper Output/ Reducer Input:

Key => Value
1 => 2
2 => 1
2 => 1
1 => 2
1 => 3
3 => 1
3 => 2
2 => 3
2 => 4
4 => 2
4 => 1
1 => 1

在减速机方面，你会得到4个不同的组：

Reducer Output:

Key => Values
1 => [2,3,4]
2 => [1,3,4]
3 => [1,2]
4 => [1,2]

现在，您可以根据需要格式化结果。 :) 如果有人能在这种方法中看到任何问题，请告诉我

Answer 2

1）简介/问题

在开始工作之前，重要的是要理解，在一个简单的方法中，reducer的值应该按升序排序。第一个想法是传递未排序的值列表，并在每个键的reducer中进行一些排序。这有两个缺点：

1）对于大值列表，很可能效率不高

和

2）如果这些对在集群的不同部分处理，框架将如何知道（1,4）是否等于（4,1）？

2）理论上的解决方案

在Hadoop中执行此操作的方法是通过创建合成密钥以某种方式“模拟”框架。

所以我们的地图功能而不是“概念上更合适”（如果我可以这么说）

map(k1, v1) -> list(k2, v2)

如下：

map(k1, v1) -> list(ksynthetic, null)

正如您注意到我们放弃了值的使用（reducer仍会获得null值的列表，但我们并不关心它们）。这里发生的是ksynthetic中的这些值实际上是包含。以下是有问题的示例：

`map(1, 2) -> list([1,2], null)

但是，还需要进行一些操作，以便对键进行适当的分组和分区，并在reducer中获得正确的结果。

3）Hadoop实施

我们将实现一个名为FFGroupKeyComparator的班级和一个班级FindFriendPartitioner。

这是我们的FFGroupKeyComparator：

public static class FFGroupComparator extends WritableComparator
{
    protected FFGroupComparator() 
    {
        super(Text.class, true);
    }

    @Override
    public int compare(WritableComparable w1, WritableComparable w2)
    {

        Text t1 = (Text) w1;
        Text t2 = (Text) w2;
        String[] t1Items = t1.toString().split(",");
        String[] t2Items = t2.toString().split(",");
        String t1Base = t1Items[0];
        String t2Base = t2Items[0];
        int comp = t1Base.compareTo(t2Base); // We compare using "real" key part of our synthetic key

        return comp;

    }
}

此类将充当我们的分组比较器类。它控制将哪些键组合在一起以便单次调用Reducer.reduce(Object, Iterable, org.apache.hadoop.mapreduce.Reducer.Context)这非常重要，因为它确保每个reducer获取适当的合成键（通过真实键来判断）。

由于Hadoop在具有许多节点的群集中运行，因此确保减少任务与分区一样多很重要。它们的数量应与真实密钥（非合成）相同。所以，通常我们用哈希值来做这件事。在我们的例子中，我们需要做的是根据真实密钥的哈希值（在逗号之前）计算合成密钥所属的分区。所以我们的FindFriendPartitioner如下：

public static class FindFriendPartitioner extends Partitioner  implements Configurable
{
    @Override
    public int getPartition(Text key, Text NullWritable, int numPartitions) 
    {

        String[] keyItems = key.toString().split(",");
        String keyBase = keyItems[0];
        int part  = keyBase.hashCode() % numPartitions;
        return part;
    }

所以现在我们都准备好写出实际的工作并解决我们的问题。

我假设您的输入文件如下所示：

1,2
2,1
1,3
3,2
2,4
4,1

我们将使用TextInputFormat。

以下是使用Hadoop 1.0.4的作业驱动程序的代码：

public class FindFriendTwo
{       
    public static class FindFriendMapper extends Mapper<Object, Text, Text, NullWritable> {

public void map(Object, Text value, Context context) throws IOException, InterruptedException 
{       
        context.write(value, new NullWritable() );

        String tempStrings[] = value.toString().split(","); 

        Text value2 = new Text(tempStrings[1] + "," + tempStrings[0]); //reverse relationship

        context.write(value2, new NullWritable());

}

}

请注意，我们还在map函数中传递了反向关系。

例如，如果输入字符串是（1,4），我们一定不要忘记（4,1）。

public static class FindFriendReducer extends Reducer<Text, NullWritable, IntWritable, IntWritable> { 

    private Set<String> friendsSet;
    public void setup(Context context)
    {
        friendSet = new LinkedHashSet<String>();
    }

    public void reduce(Text syntheticKey, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {

        String tempKeys[] = syntheticKey.toString().split(",");
        friendsSet.add(tempKeys[1]);

        if( friendsList.size() == 2 )
        {
            IntWritable key = Integer.parseInt(tempKeys[0]);
            IntWritable value = Integer.parseInt(tempKeys[1]);                
            write(key, value);
        }



   }

}

最后，我们必须记住在Main Class中包含以下内容，以便框架使用我们的类。

jobConf.setGroupingComparatorClass(FFGroupComparator.class);
jobConf.setPartitionerClass(FindFriendPartitioner.class);

Answer 3

我会按如下方式解决这个问题。

确保我们拥有所有关系，并且每次都有一次。
只需计算

关于我的方法的说明：

我对键值对的表示法是：K - ＆gt; V
键和值几乎都是数据结构（不仅仅是字符串或int）
我从不使用密钥进行数据。关键是只能控制从映射器到右减速器的流量。在其他所有地方，我根本不看钥匙。该框架确实需要一个密钥到处。使用'（）'我的意思是说有一个我完全忽略的键。
关于我的方法的关键是，它在同一时刻永远不需要记忆中的“所有朋友”（因此它也适用于非常大的情况）。

我们从很多

开始

(x,y)

我们知道数据集中没有所有关系。

Mapper：创建所有关系

Input:  ()    -> (x,y)
Output: (x,y) -> (x,y)
        (y,x) -> (y,x)

Reducer：删除重复项（只是从迭代器输出第一个）

Input:  (x,y) -> [(x,y),(x,y),(x,y),(x,y),.... ]
Output: ()    -> (x,y)

Mapper：“Wordcount”

Input:  ()  -> (x,y)
Output: (x) -> (x,1)

减速器：计算它们

Input:  (x) -> [(x,1),(x,1),(x,1),(x,1),.... ]
Output: ()  -> (x,N)

Answer 4

在众多优秀工程师的帮助下，我终于尝试了解决方案。

只有一个Mapper和一个Reducer。这里没有组合器。

输入Mapper：

1,2
2,1
1,3
3,1
3,2
3,4
5,1

Mapper的输出：

1,2
2,1
1,2
2,1
1,3
3,1
1,3
3,1
4,3
3,4
1,5
5,1

减速机的输出：

第一个是用户，第二个是朋友＃。

在reducer阶段，我将hashSet添加到助手分析中。谢谢@Artem Tsikiridis @Ashish 你的回答给了我一个很好的线索。

<强>编辑：

添加了代码：

//映射器

public static class TokenizerMapper extends
        Mapper<Object, Text, Text, Text> {
    private Text word1 = new Text();
    private Text word2 = new Text();

    public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer itr = new StringTokenizer(line,",");
        if(itr.hasMoreElements()){
         word1.set(itr.nextToken().toLowerCase());

        }
        if(itr.hasMoreElements()){
            word2.set(itr.nextToken().toLowerCase());

        }
        context.write(word1, word2);
        context.write(word2, word1);

//
} }

// reducer

public static class IntSumReducer extends
        Reducer<Text, Text, Text, IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<Text> values,
            Context context) throws IOException, InterruptedException {
        HashSet<Text> set = new HashSet<Text>();
          int sum = 0;
          for (Text val : values) {
                if(!set.contains(val)){
                    set.add(val);
                    sum++;
                }
          }   

          result.set(sum);
          context.write(key, result);

    }
}

找到所有用户的朋友：如何使用Hadoop Mapreduce实现？

4 个答案:

1）简介/问题

2）理论上的解决方案

3）Hadoop实施

添加了代码：