我可以在不使用临时文件的情况下聚集Accumulo数据吗?

时间:2014-09-15 21:57:53

标签: opencv mahout apache-spark k-means accumulo

我想对我们在Accumulo中的一些数据执行kmeans聚类。我的第一个想法是在Apache Mahout中使用kmeans聚类,但是我很难在不使用临时文件的情况下连接两者。尽管我可以说,为了使用Mahout,我需要将Accumulo数据写入存储在HDFS中的一系列矢量文件中,然后使用Mahout将它们聚类,然后将结果写回Accumulo(Mahout入口点)所有似乎都采取目录的路径)。虽然我还没有尝试过,但这似乎只是表演的噩梦。有没有更好的办法?或者,是否有其他kmeans聚类库可以更容易地连接到Accumulo?我现在正在研究opencv,但欢迎其他建议。

1 个答案:

答案 0 :(得分:0)

正如@FuriousGeorge建议的那样,我研究了Apache Spark。这确实提供了一种在不使用临时文件的情况下执行kmeans聚类的方法,如下所示:

import org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.security.Authorizations;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;
import org.apache.spark.mllib.linalg.Vector;
import scala.Tuple2;

public class ClusterAccumuloData
{
  public static void main(String[] args)
  {
    JavaSparkContext sc = new JavaSparkContext("yarn-cluster",
                                               "JobName",
                                               "/spark/installation/directory",
                                               "/path/to/jar/file/containing/this/class");
    Configuration conf = new Configuration(); // As near as I can tell, this is all we need.
    Authorizations auths = new Authorizations("whatever_you_need");
    AccumuloInputFormat.setInputInfo(conf,
                                     "accumulo_user",
                                     "users_password".getBytes(),
                                     "accumulo_table_name",
                                     auths);
    AccumuloInputFormat.setZooKeeperInstance(conf, 
                                             "accumulo_instance_name",
                                             "zookeeper_server_1,zookeeper_server_2");
    // Calls to other AccumuloInputFormat functions (such as setRanges or addIterator)
    // that configure it to retrieve the data you wish to cluster.
    JavaPairRDD<Key, Value> accumuloRDD = sc.newAPIHadoopRDD(conf,
                                                             AccumuloInputFormat.class,
                                                             Key.class,
                                                             Value.class);
    JavaRDD<Vector> kmeansDataRDD =
      accumuloRDD.map(new Function<Tuple2<Key, Value>, Vector>()
                      {
                        public Vector call(Tuple2<Key, Value> accumuloData)
                        {
                          // Code which transforms accumuloData into either a
                          // DenseVector or a SparseVector, then returns that Vector.
                        }
                      });
    KMeansModel kmm = KMeans.train(JavaRDD.toRDD(kmeansDataRDD), 42, 14, 37);
  }
}