将数组[seq [String]]传递给spark scala中的UDF

时间:2017-07-14 09:09:19

标签: scala apache-spark user-defined-functions

我是星火中的UDF新手。我还阅读了答案here

问题陈述:我正在尝试从数据集col找到模式匹配。

Ex:Dataframe

val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
             (3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")

df.show()

+---+--------------------+
| id|                text|
+---+--------------------+
|  1|                   z|
|  2|         abs,abc,dfg|
|  3|a,b,c,d,e,f,abs,a...|
+---+--------------------+


df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row

现在我想为$ text列中的每一行执行此模式匹配,并添加名为count的新列。

结果:

+---+--------------------+-----+
| id|                text|count|
+---+--------------------+-----+
|  1|                   z|    1|
|  2|         abs,abc,dfg|    2|
|  3|a,b,c,d,e,f,abs,a...|    1|
+---+--------------------+-----+

我尝试将$ text文件列的udf定义为Array [Seq [String]。但我无法达到我的意图。

到目前为止我尝试了什么:

val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()

任何帮助将不胜感激

1 个答案:

答案 0 :(得分:1)

您必须知道text列的所有元素,可以使用collect_list grouping rows dataframe text作为一个。然后,只需检查收集的数组中count列中的元素和import sqlContext.implicits._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions._ val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text") val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size) df.withColumn("grouping", lit("g")) .withColumn("array", collect_list("text").over(Window.partitionBy("grouping"))) .withColumn("count", valsum($"text", $"array")) .drop("grouping", "array") .show(false) 中的元素,如下面的代码所示。

+---+-----------------------+-----+
|id |text                   |count|
+---+-----------------------+-----+
|1  |z                      |1    |
|2  |abs,abc,dfg            |2    |
|3  |a,b,c,d,e,f,abs,abc,dfg|1    |
+---+-----------------------+-----+

您应该有以下输出

<?php    
    $pwort = 'mypassword';
    $port = ':80';

    $dyntxt = "my_IP.txt";
    $pworttest = $_GET["pass"];
    $IP = $_GET["meineip"];

    if (file_exists($dyntxt)){
        if($pworttest==$pwort) {
            $a = fopen("$dyntxt", "w");
            $dynamicip = $_SERVER["REMOTE_ADDR"];
            fwrite($a, $IP);
            fclose($a);
        } 
        else {
            $a = fopen("$dyntxt", "r+"); 
            $dynamicip = fread($a,filesize($dyntxt));
            fclose($a);

            $url="http://".$dynamicip."".$port; 
            header("Location: $url");
        }
    }
?>

我希望这会有所帮助。