MapReduce冠层聚类中心

时间:2014-01-05 09:11:22

标签: java hadoop map reduce canopy

我正在尝试理解这个代码用于冠层群集。这两个类(一个地图,一个减少)的目的是找到冠层中心。我的问题是我不明白map和reduce函数之间的区别。他们几乎一样。

那有区别吗?或者我只是在减速器中重复相同的过程?

我认为答案是map和reduce函数处理代码的方式有所不同。即使使用类似的代码,它们也会对数据执行不同的操作。

有人可以解释一下地图的过程,并在我们试图找到冠层中心时减少吗?

我知道例如地图可能看起来像这样 - (joe,1)(dave,1)(joe,1)(joe,1)

然后减少将如下:---(joe,3)(dave,1)

同样的事情发生在这里吗?

或者我可能两次执行相同的任务?

非常感谢。

地图功能:

package nasdaq.hadoop;

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;

public class CanopyCentersMapper extends Mapper<LongWritable, Text, Text, Text> {
    //A list with the centers of the canopy
    private ArrayList<ArrayList<String>> canopyCenters;

@Override
public void setup(Context context) {
        this.canopyCenters = new ArrayList<ArrayList<String>>();
}

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    //Seperate the stock name from the values to create a key of the stock and a list of values - what is list of values?
    //What exactly are we splitting here?
    ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(value.toString().split(","))); 

    //remove stock and make first canopy center around it canopy center
    String stockKey = stockData.remove(0);

    //?
    String stockValue = StringUtils.join(",", stockData);

    //Check wether the stock is avaliable for usage as a new canopy center
    boolean isClose = false;    

    for (ArrayList<String> center : canopyCenters) {    //Run over the centers

    //I think...let's say at this point we have a few centers. Then we have our next point to check.
    //We have to compare that point with EVERY center already created. If the distance is larger than EVERY T1
    //then that point becomes a new center! But the more canopies we have there is a good chance it is within
    //the radius of one of the canopies...

            //Measure the distance between the center and the currently checked center
            if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
                    //Center is too close
                    isClose = true;
                    break;
            }
    }

    //The center is not smaller than the small radius, add it to the canopy
    if (!isClose) {
        //Center is not too close, add the current data to the center
        canopyCenters.add(stockData);

        //Prepare hadoop data for output
        Text outputKey = new Text();
        Text outputValue = new Text();

        outputKey.set(stockKey);
        outputValue.set(stockValue);

        //Output the stock key and values to reducer
        context.write(outputKey, outputValue);
    }
}

}

减少功能:

    package nasdaq.hadoop;

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;

public class CanopyCentersReducer extends Reducer<Text, Text, Text, Text> {
    //The canopy centers list
    private ArrayList<ArrayList<String>> canopyCenters;

@Override
public void setup(Context context) {
        //Create a new list for the canopy centers
        this.canopyCenters = new ArrayList<ArrayList<String>>();
}

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    for (Text value : values) {
        //Format the value and key to fit the format
        String stockValue = value.toString();
        ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(stockValue.split(",")));
        String stockKey = key.toString();

        //Check wether the stock is avaliable for usage as a new canopy center
        boolean isClose = false;    
        for (ArrayList<String> center : canopyCenters) {    //Run over the centers
                //Measure the distance between the center and the currently checked center
                if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
                        //Center is too close
                        isClose = true;
                        break;
                }
        }

        //The center is not smaller than the small radius, add it to the canopy
        if (!isClose) {
            //Center is not too close, add the current data to the center
            canopyCenters.add(stockData);

            //Prepare hadoop data for output
            Text outputKey = new Text();
            Text outputValue = new Text();

            outputKey.set(stockKey);
            outputValue.set(stockValue);

            //Output the stock key and values to reducer
            context.write(outputKey, outputValue);
        }


    }

* *编辑 - 更多代码和解释

Stockkey是代表股票的关键值。 (纳斯达克和类似的东西)

ClusterJob.measureDistance():

    public static double measureDistance(ArrayList<String> origin, ArrayList<String> destination)
{
    double deltaSum = 0.0;
    //Run over all points in the origin vector and calculate the sum of the squared deltas
    for (int i = 0; i < origin.size(); i++) {
        if (destination.size() > i) //Only add to sum if there is a destination to compare to
        {
            deltaSum = deltaSum + Math.pow(Math.abs(Double.valueOf(origin.get(i)) - Double.valueOf(destination.get(i))),2);
        }
    }
    //Return the square root of the sum
    return Math.sqrt(deltaSum);

1 个答案:

答案 0 :(得分:2)

好的,对代码的直接解释是: - 映射器遍历一些(可能是随机的)数据子集,并生成冠层中心,所有冠层中心彼此之间的距离至少为T1。这些中心被排放。 - 然后减速器遍历所有映射器中属于每个特定库存密钥(如MSFT,GOOG等)的所有树冠中心,然后确保每个库存键值都没有彼此在T1内的树冠中心(例如,没有两个中心位于彼此的T1内,尽管MSFT的中心和GOOG的中心可能会在一起。)

代码的目标尚不清楚,我个人认为必须有一个bug。减速器基本上解决了问题,就好像你试图独立地为每个股票密钥生成中心(即,为GOOG的所有数据点计算冠层中心),而映射器似乎解决了试图为所有股票生成中心的问题。像这样放在一起,你就会产生矛盾,所以问题实际上都没有得到解决。

如果您想要所有库存密钥的中心: - 然后地图输出必须将所有内容发送到ONE reducer。将地图输出键设置为像NullWritable这样的小事。然后减速器将执行正确的操作而不做任何更改。

如果您想要每个库存密钥的中心: - 然后需要更改映射器,以便有效地为每个库存密钥设置一个单独的顶篷列表,您可以通过为每个库存密钥保留一个单独的arrayList来执行此操作(首选,因为它会更快)或者,您可以更改距离度量,使得属于不同库存键的库存键相隔无限远(因此它们永远不会相互作用)。

P.S。顺便提一下,您的距离指标也存在一些不相关的问题。首先,您使用Double.parseDouble解析数据,但不捕获NumberFormatException。因为你给它stockData,它在第一个字段中包含了像'GOOG'这样的非数字字符串,所以一旦你运行它就会最终崩溃。其次,距离度量忽略任何缺少值的字段。这是L2(毕达哥拉斯)距离度量的不正确实现。要了解原因,请考虑此字符串:“,”与任何其他点的距离为0,如果选择它作为冠层中心,则不能选择其他中心。您可以考虑将其设置为合理的值,例如该属性的总体平均值,或者(为了安全)只是为了进行聚类而从数据集中丢弃该行,而不是仅仅将缺失维度的增量设置为零。 / p>