Question

Hi Apache Mahout Experts，

我编写了一段简单的代码，用输入数据读取文件并创建一些群集。

我使用0.9版本。

我想在群集中打印数据。

我使用3种方法实施了课程CanopyClustering：convertToVectorFile() createClusters()和getClustersInfo()

第一种方法，将带有点的文件转换为正确的格式，第二种方法创建聚类，最后一种方法将数据打印到标准输出。

当我运行我的代码时，我可以看到以下输出：

DEBUG Groups -  Creating new Groups object
DEBUG Groups - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
DEBUG UserGroupInformation - hadoop login
DEBUG UserGroupInformation - hadoop login commit
DEBUG UserGroupInformation - using local user:NTUserPrincipal : myname
DEBUG UserGroupInformation - UGI loginUser:myname
DEBUG FileSystem - Creating filesystem for file:///
DEBUG NativeCodeLoader - Trying to load the custom-built native-hadoop library...
DEBUG NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
DEBUG NativeCodeLoader - java.library.path=C:\Program Files\Java\jre7\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\MATLAB\R2009b\runtime\win64;C:\Program Files\MATLAB\R2009b\bin;C:\Program Files\TortoiseSVN\bin;C:\Users\myname\Documents\apache-maven-3.1.1\bin;.
WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO  CanopyDriver - Build Clusters Input: C:/Users/myname/Documents/jboss-as-7.1.1.Final/jboss-as-7.1.1.Final/bin/BI/synthetic_control.seq Out: C:/Users/myname/Documents/jboss-as-7.1.1.Final/jboss-as-7.1.1.Final/bin/BI/output Measure: org.apache.mahout.common.distance.EuclideanDistanceMeasure@5613e573 t1: 3.0 t2: 3.0
DEBUG CanopyClusterer - Created new Canopy:0 at center:[1.000, 2.000]
DEBUG CanopyClusterer - Added point: [2.000, 1.000] to canopy: C-0
DEBUG CanopyClusterer - Added point: [3.000, 2.000] to canopy: C-0
DEBUG CanopyClusterer - Added point: [2.000, 3.000] to canopy: C-0
DEBUG CanopyClusterer - Created new Canopy:1 at center:[4.000, 18.000]
DEBUG CanopyClusterer - Added point: [5.000, 17.000] to canopy: C-1
DEBUG CanopyClusterer - Added point: [6.000, 18.000] to canopy: C-1
DEBUG CanopyClusterer - Added point: [5.000, 19.000] to canopy: C-1
DEBUG CanopyDriver - Writing Canopy:C-0 center:[2.000, 2.000] numPoints:4 radius:[0.707, 0.707]
DEBUG CanopyDriver - Writing Canopy:C-1 center:[5.000, 18.000] numPoints:4 radius:[0.707, 0.707]
DEBUG FileSystem - Starting clear of FileSystem cache with 1 elements.
DEBUG FileSystem - Removing filesystem for file:///
DEBUG FileSystem - Removing filesystem for file:///
DEBUG FileSystem - Done clearing cache

并且创建了很少的文件

C:.
│   .synthetic_control.seq.crc
│   synthetic_control.data
│   synthetic_control.seq
│
└───output
    ├───clusteredPoints
    │       .part-m-0.crc
    │       part-m-0
    │
    └───clusters-0-final
            .part-r-00000.crc
            ._policy.crc
            part-r-00000
            _policy

一切看起来都不错，但最后一种方法的输出为空。我尝试了几种不同的方法，但我所取得的只是打印具有中心和半径的簇的名称，但我并不需要它。

提前谢谢

public class Main {
    public static void main(String[] args) {

        BIManager bi = new BIManager(new CanopyClustering());

        bi.convertToVectorFile();
        bi.createClusters();
        bi.getClustersInfo();
    }
}

import java.util.List;

public class BIManager {
    private IClustering clustering;

    public BIManager(IClustering clustering) {
        this.clustering = clustering;
    }

    public void convertToVectorFile() {
        this.clustering.convertToVectorFile();
    }

    public void createClusters() {
        this.clustering.createClusters();

    }

    public List<String> getClustersInfo() {
        return this.clustering.getClustersInfo();
    }
}

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.mahout.clustering.canopy.CanopyDriver;
import org.apache.mahout.clustering.classify.WeightedVectorWritable;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.common.iterator.sequencefile.PathFilters;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import org.apache.mahout.math.NamedVector;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;

import com.my.package.bi.IClustering;
public class CanopyClustering implements IClustering {

    private final static String root = "C:\\Users\\myname\\Documents\\jboss-as-7.1.1.Final\\jboss-as-7.1.1.Final\\bin\\BI\\";
    private final static String dataDir = root + "synthetic_control.data";
    private final static String seqDir = root + "synthetic_control.seq";
    private final static String outputDir = root + "output";
    private final static String partMDir = outputDir + "\\" + "clusters-0-final" + "\\part-r-00000";
    private final static String SEPARATOR = " ";
    private final static int NUMBER_OF_ELEMENTS = 2;

    private Configuration conf;
    private FileSystem fs;

    public CanopyClustering() {
        conf = new Configuration();
        try {
            fs = FileSystem.get(conf);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    @Override
    public void convertToVectorFile() {

        try {
            BufferedReader reader = new BufferedReader(new FileReader(dataDir));
            SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
                    new Path(seqDir), LongWritable.class, VectorWritable.class);

            String line;
            long counter = 0;
            while ((line = reader.readLine()) != null) {
                String[] c;
                c = line.split(SEPARATOR);
                double[] d = new double[c.length];
                for (int i = 0; i < NUMBER_OF_ELEMENTS; i++) {
                    try {
                        d[i] = Double.parseDouble(c[i]);

                    } catch (Exception ex) {
                        d[i] = 0;
                    }
                }

                Vector vec = new RandomAccessSparseVector(c.length);
                vec.assign(d);

                VectorWritable writable = new VectorWritable();
                writable.set(vec);
                writer.append(new LongWritable(counter++), writable);
            }
            writer.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    @Override
    public void createClusters() {

        double t1 = 3;
        double t2 = 3;
        double clusterClassificationThreshold = 3;
        boolean runSequential = true;

        EuclideanDistanceMeasure measure = new EuclideanDistanceMeasure();
        Path inputPath = new Path(seqDir);
        Path outputPath = new Path(outputDir);

        try {
            CanopyDriver.run(inputPath, outputPath, measure, t1, t2,
                    runSequential, clusterClassificationThreshold,
                    runSequential);
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

    @Override
    public List<String> getClustersInfo() {

        List<String> results = new ArrayList<String>();

        String s = outputDir + "\\clusteredPoints\\part-m-0";

        Path path = new Path(s);
        for (Pair<IntWritable, WeightedVectorWritable> record : new SequenceFileDirIterable<IntWritable, WeightedVectorWritable>(
                path, PathType.GLOB, PathFilters.logsCRCFilter(), conf)) {
            NamedVector vec = ((NamedVector) record.getSecond().getVector());
            System.out.println(record.getFirst().get() + "  " + vec.getName());
        }

        return results;

    }
}

import java.util.List;

public interface IClustering {

    public void convertToVectorFile();

    public void createClusters();

    public List<String> getClustersInfo();
}

1.0 2.0
2.0 1.0
3.0 2.0
2.0 3.0
4.0 18.0
5.0 17.0
6.0 18.0
5.0 19.0

Answer 1

您应该查看org.apache.mahout.utils.clustering中的ClusterDumper类。它采用集群文件的位置，并以合适的文件格式文本json或csv打印内容。

我只是通过二进制clusterdump从命令行使用它。例如。

mahout clusterdump 
-s ~/Downloads/reuters21578/parsedtext-kmeans/clusters-*-final
-d ~/Downloads/reuters21578/parsedtext-seqdir-sparse-kmeans/dictionary.file-0
-dt sequencefile -b 100 -n 20 --evaluate 
-dm org.apache.mahout.common.distance.CosineDistanceMeasure
--pointsDir ~/Downloads/reuters21578/parsedtext-kmeans/clusteredPoints
-o ~/cluster-output.txt

您可以直接使用它，也可以复制代码，修改然后再使用它。

<强>更新

您必须使用--pointsDir选项来提供积分dir。在群集中，群集输出仅存储id，数字和半径。这些点实际上存储在另一个文件中。请注意群集 - * - final和clustered points文件夹。

使用群集转储时，将打印数字，半径和点。另请尝试使用输出文件选项-of作为csv。请参阅选项here。

如果要控制输出的格式，则必须使用迭代器类。使用课程org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable。这些将帮助您阅读使用特定格式的clusterpoint文件。请参阅给出plotClusteredSampleData给定vec = record.getSecond().getVector(); if (vec instanceof NamedVector) { System.out.println(record.getFirst().get() + " " + vec.getName()); } else { System.out.println(record.getFirst().get() + " " + vec.asFormatString()); }函数的点的示例。

<强> UPDATE2

通过你的代码，发现你正在将矢量强制为NamedVector，所以试试这个：

{{1}}

Answer 2

如果您针对Present trunk或Mahout 0.9运行，请在您的getClustersInfo（）方法中将WeightedVectorWritable更改为WeightedPropertyVectorWritable。

您可能不需要强制转换为NamedVector。

如何从Apache Mahout中的Canopy Clusters打印数据

2 个答案: