Question

我在CERN的网站上找到了以下代码。仅供参考：我正在使用spark 1.3 当您知道要为elasticsearch索引的数据集的架构时，示例代码非常棒。

然而，有人可以指出我正确的方向，以便我可以创建如下方法：

作为参数传入来自外部源（col名称/数据类型）（硬位）的架构结构以及要编入索引的文件名（简单位）？动态执行函数内的模式映射。

通过这样的方法，我可以在ES中生成映射和索引的数据集。

示例代码：

    //import elasticsearch packages
    import org.elasticsearch.spark._

    //define the schema
    case class MemT(dt: String (link is external), server: String (link is external), memoryused: Integer (link is external)) 

    //load the csv file into rdd
    val Memcsv = sc.textFile("/tmp/flume_memusage.csv") 

    //split the fields, trim and map it to the schema
    val MemTrdd = Memcsv.map(line=>line.split(",")).map(line=>MemT(line(0).trim.toString,line(1).trim.toString,line(2).trim.toInt))

    //write the rdd to the elasticsearch
    MemTrdd.saveToEs("fmem/logs")

谢谢！

源： https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying

Answer 1

我想要实现的是能够从DataFrame直接索引到ES中。我要求从外部模式源驱动索引映射。以下是我如何实现这一目标....

BTW：我省略了额外的验证/处理，但这个骨架代码应该让那些需要类似需求的人去....

我在build.sbt文件中包含了以下ES依赖项

"org.elasticsearch" % "elasticsearch-spark_2.10" % "2.3.3"

欢迎评论......

//Just showing the ES stuff
import org.elasticsearch.hadoop
import org.elasticsearch.spark._


//Declare a schema
val schemaString = "age:int, name:string,location:string"

//Fill RDD with dummy data
val rdd = sc.textFile("/path/to/your/file.csv")

val seperator = "," //This is seperator in csv 

//Convert schema string above into a struct
val schema =
  StructType(
  schemaString.split(",").map(fieldName => 
  StructField(fieldName.split(":")(0),
  getFieldTypeInSchema(fieldName.split(":")(1)), true)))

  //map each element of RDD row to RDD with elements
  val rowRDDx =rdd.map(p => {
    var list: collection.mutable.Seq[Any] = collection.mutable.Seq.empty[Any]
    var index = 0
    var tokens = p.split(seperator)
    tokens.foreach(value => {
      var valType = schema.fields(index).dataType
      var returnVal: Any = null
      valType match {
        case IntegerType => returnVal = value.toString.toInt
        case DoubleType => returnVal = value.toString.toDouble
        case LongType => returnVal = value.toString.toLong
        case FloatType => returnVal = value.toString.toFloat
        case ByteType => returnVal = value.toString.toByte
        case StringType => returnVal = value.toString
        case TimestampType => returnVal = value.toString
        }
      list = list :+ returnVal
      index += 1
      })
    Row.fromSeq(list)
    })

//Convert the RDD with elements  to a DF , also specify the intended schema
val df = sqlContext.createDataFrame(rowRDDx, schema)

//index the DF to ES
EsSparkSQL.saveToEs(df,"test/doc")

Scala ElasticSearch与动态更改架构相关的索引

1 个答案: