在数组类型列spark数据帧中不区分大小写的搜索

时间:2017-07-25 17:00:07

标签: scala apache-spark

我的火花数据框如下:

#include <stdio.h>
#include <string.h>

int main(void) 
{
    char s1[12] = "Hello ";
    char *s2 = "World";

    for ( size_t i = 0; s2[i] != '\0'; i++ )
    {
        strncat( s1, &s2[i], 1 );
    }

    puts( s1 );

    return 0;
}

我想过滤上面列中的数据,不区分大小写。 目前我这样做。

 +----------+-------------------------------------------------+  
 |col1      |words                                            |  
 +----------+-------------------------------------------------+  
 |An        |[An, attractive, ,, thin, low, profile]          |  
 |attractive|[An, attractive, ,, thin, low, profile]          |  
 |,         |[An, attractive, ,, thin, low, profile]          |  
 |thin      |[An, attractive, ,, thin, low, profile]          |    
 |rail      |[An, attractive, ,, thin, low, profile]          |  
 |profile   |[An, attractive, ,, thin, low, profile]          |  
 |Lighter   |[Lighter, than, metal, ,, Level, ,, and, tes]    |  
 |than      |[Lighter, than, metal, ,, Level, ,, and, tww]    |  
 |steel     |[Lighter, than, metal, ,, Level, ,, and, test]   |  
 |,         |[Lighter, than, metal, ,, Level, ,, and, Test]   |  
 |Level     |[Lighter, than, metal, ,, Level, ,, and, test]   |  
 |,         |[Lighter, than, metal, ,, Level, ,, and, ste]    |  
 |and       |[Lighter, than, metal, ,, Level, ,, and, ste]    |  
 |Test      |[Lighter, than, metal, ,, Level, ,, and, Ste]    |  
 |Renewable |[Renewable, resource]                            |  
 |Resource  |[Renewable, resource]                            |  
 |No        |[No1, Bal, testme, saves, time, and, money]      |  
 +----------+-------------------------------------------------+  

但它没有显示任何数据。 请帮我解决这个问题。

2 个答案:

答案 0 :(得分:1)

为此你可以创建一个简单的udf,将大小写转换为小写和过滤器

这是一个简单的例子,

scala> import spark.implicits._
import spark.implicits._

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq(("An", List("An", "attractive"," ","", "thin", "low", "profile")), ("Lighter", List("Lighter", "than", "metal"," " ,"", "Level"," " ,"", "and", "tes"))).toDF("col1", "words")
df: org.apache.spark.sql.DataFrame = [col1: string, words: array<string>]

scala> val filterUdf = udf((arr: Seq[String]) => arr.map(_.toLowerCase).contains("level".toLowerCase))
filterUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(ArrayType(StringType,true))))

scala> df.filter(filterUdf($"words")).show(false)

+-------+-------------------------------------------------+
|col1   |words                                            |
+-------+-------------------------------------------------+
|Lighter|[Lighter, than, metal,  , , Level,  , , and, tes]|
+-------+-------------------------------------------------+

希望这有帮助!

答案 1 :(得分:1)

DataSetsDataFrames更容易使用,因此我建议您将dataframe转换为dataset或从源代码创建DataSet数据。

假设您有一个case class

的数据集
case class data(col1: String, words: Array[String])

为了便于说明,我正在创建一个临时的dataset作为

import sqlContext.implicits._
val ds = Seq(
  data("profile", Array("An", "attractive", "", "", "thin", "low", "profile")),
  data("Lighter", Array("Lighter", "than", "metal", "", "", "Level", "", "", "and", "tes"))
).toDS

类似于您拥有的dataframe

+-------+-----------------------------------------------+
|col1   |words                                          |
+-------+-----------------------------------------------+
|profile|[An, attractive, , , thin, low, profile]       |
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-----------------------------------------------+

您可以在包含RDD的{​​{1}}中的datasetfilter上执行与rows类似的操作

Level

结果是

ds.filter(row => row.words.map(element => element.toLowerCase).contains("level"))

<强>更新

在您努力将+-------+-----------------------------------------------+ |col1 |words | +-------+-----------------------------------------------+ |Lighter|[Lighter, than, metal, , , Level, , , and, tes]| +-------+-----------------------------------------------+ 转换为dataframe时,这是实现此目标的方法之一

假设您有datasetdataframe

df

然后应创建+---+-------------+--------+---+ |age|maritalStatus|name |sex| +---+-------------+--------+---+ |35 |M |Joanna |F | |25 |S |Isabelle|F | |19 |S |Andy |M | |70 |M |Robert |M | +---+-------------+--------+---+ 以匹配case class的{​​{1}}

schema

然后更改别名就可以了解

df

然后您可以按照本答复的第一部分所述进行操作。

我希望答案很有帮助