Scala ElasticSearch与动态更改架构相关的索引

时间:2016-08-04 06:51:42

标签: scala elasticsearch indexing apache-spark mapping

我在CERN的网站上找到了以下代码。 仅供参考:我正在使用spark 1.3 当您知道要为elasticsearch索引的数据集的架构时,示例代码非常棒。

然而,有人可以指出我正确的方向,以便我可以创建如下方法:

作为参数传入来自外部源(col名称/数据类型)(硬位)的架构结构以及要编入索引的文件名(简单位)? 动态执行函数内的模式映射。

通过这样的方法,我可以在ES中生成映射和索引的数据集。

示例代码:

    //import elasticsearch packages
    import org.elasticsearch.spark._

    //define the schema
    case class MemT(dt: String (link is external), server: String (link is external), memoryused: Integer (link is external)) 

    //load the csv file into rdd
    val Memcsv = sc.textFile("/tmp/flume_memusage.csv") 

    //split the fields, trim and map it to the schema
    val MemTrdd = Memcsv.map(line=>line.split(",")).map(line=>MemT(line(0).trim.toString,line(1).trim.toString,line(2).trim.toInt))

    //write the rdd to the elasticsearch
    MemTrdd.saveToEs("fmem/logs")

谢谢!

源: https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying

1 个答案:

答案 0 :(得分:0)

我想要实现的是能够从DataFrame直接索引到ES中。 我要求从外部模式源驱动索引映射。以下是我如何实现这一目标....

BTW:我省略了额外的验证/处理,但这个骨架代码应该让那些需要类似需求的人去....

我在build.sbt文件中包含了以下ES依赖项

"org.elasticsearch" % "elasticsearch-spark_2.10" % "2.3.3"

欢迎评论......

//Just showing the ES stuff
import org.elasticsearch.hadoop
import org.elasticsearch.spark._


//Declare a schema
val schemaString = "age:int, name:string,location:string"

//Fill RDD with dummy data
val rdd = sc.textFile("/path/to/your/file.csv")

val seperator = "," //This is seperator in csv 

//Convert schema string above into a struct
val schema =
  StructType(
  schemaString.split(",").map(fieldName => 
  StructField(fieldName.split(":")(0),
  getFieldTypeInSchema(fieldName.split(":")(1)), true)))

  //map each element of RDD row to RDD with elements
  val rowRDDx =rdd.map(p => {
    var list: collection.mutable.Seq[Any] = collection.mutable.Seq.empty[Any]
    var index = 0
    var tokens = p.split(seperator)
    tokens.foreach(value => {
      var valType = schema.fields(index).dataType
      var returnVal: Any = null
      valType match {
        case IntegerType => returnVal = value.toString.toInt
        case DoubleType => returnVal = value.toString.toDouble
        case LongType => returnVal = value.toString.toLong
        case FloatType => returnVal = value.toString.toFloat
        case ByteType => returnVal = value.toString.toByte
        case StringType => returnVal = value.toString
        case TimestampType => returnVal = value.toString
        }
      list = list :+ returnVal
      index += 1
      })
    Row.fromSeq(list)
    })

//Convert the RDD with elements  to a DF , also specify the intended schema
val df = sqlContext.createDataFrame(rowRDDx, schema)

//index the DF to ES
EsSparkSQL.saveToEs(df,"test/doc")