How to use instr() function with Column type arguments in Spark

时间:2018-07-13 08:25:25

标签: scala apache-spark apache-spark-sql

I have a problem with using instr() function in Spark. The definition of function looks like below:

instr(Column str, String substring)

The problem is that I need to use Column type value as second argument. I created example function which get two Column type arguments:

def test_func(val1:Column, val2:Column) : Column = {
    val instr_val : Column = instr(val2, val1)

    return instr_val
   }

val df = sc.parallelize(Seq((123, "940932123"), (940, "123940932"), (932, "940123932"))).toDF("KOL1", "KOL2")
df.withColumn("KOL3", test_func($"A", $"B")).show

It gives error like this:

<console>:322: error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: String
           val instr_val : Column = instr(val2, val1)

I tried to use expr() function, but it gives error too. Does anyone know how to fix that?

1 个答案:

答案 0 :(得分:0)

instr并非按您想使用的方式使用,但您始终可以尝试定义udf来完成工作:

scala> val instr2_ : (String, String) => Int = (str, sub) => str.indexOfSlice(sub)
// instr2_: (String, String) => Int = <function2>

scala> val instr2 = udf(instr2_)
// instr2: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(StringType, StringType)))

scala> df.withColumn("KOL3", instr2($"KOL2",$"KOL1")).show
// +----+---------+----+
// |KOL1|     KOL2|KOL3|
// +----+---------+----+
// | 123|940932123|   6|
// | 940|123940932|   3|
// | 932|940123932|   6|
// +----+---------+----+