我在CERN的网站上找到了以下代码。 仅供参考:我正在使用spark 1.3 当您知道要为elasticsearch索引的数据集的架构时,示例代码非常棒。
然而,有人可以指出我正确的方向,以便我可以创建如下方法:
作为参数传入来自外部源(col名称/数据类型)(硬位)的架构结构以及要编入索引的文件名(简单位)? 动态执行函数内的模式映射。
通过这样的方法,我可以在ES中生成映射和索引的数据集。
示例代码:
//import elasticsearch packages
import org.elasticsearch.spark._
//define the schema
case class MemT(dt: String (link is external), server: String (link is external), memoryused: Integer (link is external))
//load the csv file into rdd
val Memcsv = sc.textFile("/tmp/flume_memusage.csv")
//split the fields, trim and map it to the schema
val MemTrdd = Memcsv.map(line=>line.split(",")).map(line=>MemT(line(0).trim.toString,line(1).trim.toString,line(2).trim.toInt))
//write the rdd to the elasticsearch
MemTrdd.saveToEs("fmem/logs")
谢谢!
答案 0 :(得分:0)
我想要实现的是能够从DataFrame直接索引到ES中。 我要求从外部模式源驱动索引映射。以下是我如何实现这一目标....
BTW:我省略了额外的验证/处理,但这个骨架代码应该让那些需要类似需求的人去....
我在build.sbt文件中包含了以下ES依赖项
"org.elasticsearch" % "elasticsearch-spark_2.10" % "2.3.3"
欢迎评论......
//Just showing the ES stuff
import org.elasticsearch.hadoop
import org.elasticsearch.spark._
//Declare a schema
val schemaString = "age:int, name:string,location:string"
//Fill RDD with dummy data
val rdd = sc.textFile("/path/to/your/file.csv")
val seperator = "," //This is seperator in csv
//Convert schema string above into a struct
val schema =
StructType(
schemaString.split(",").map(fieldName =>
StructField(fieldName.split(":")(0),
getFieldTypeInSchema(fieldName.split(":")(1)), true)))
//map each element of RDD row to RDD with elements
val rowRDDx =rdd.map(p => {
var list: collection.mutable.Seq[Any] = collection.mutable.Seq.empty[Any]
var index = 0
var tokens = p.split(seperator)
tokens.foreach(value => {
var valType = schema.fields(index).dataType
var returnVal: Any = null
valType match {
case IntegerType => returnVal = value.toString.toInt
case DoubleType => returnVal = value.toString.toDouble
case LongType => returnVal = value.toString.toLong
case FloatType => returnVal = value.toString.toFloat
case ByteType => returnVal = value.toString.toByte
case StringType => returnVal = value.toString
case TimestampType => returnVal = value.toString
}
list = list :+ returnVal
index += 1
})
Row.fromSeq(list)
})
//Convert the RDD with elements to a DF , also specify the intended schema
val df = sqlContext.createDataFrame(rowRDDx, schema)
//index the DF to ES
EsSparkSQL.saveToEs(df,"test/doc")