我的火花数据框如下:
#include <stdio.h>
#include <string.h>
int main(void)
{
char s1[12] = "Hello ";
char *s2 = "World";
for ( size_t i = 0; s2[i] != '\0'; i++ )
{
strncat( s1, &s2[i], 1 );
}
puts( s1 );
return 0;
}
我想过滤上面列中的数据,不区分大小写。 目前我这样做。
+----------+-------------------------------------------------+
|col1 |words |
+----------+-------------------------------------------------+
|An |[An, attractive, ,, thin, low, profile] |
|attractive|[An, attractive, ,, thin, low, profile] |
|, |[An, attractive, ,, thin, low, profile] |
|thin |[An, attractive, ,, thin, low, profile] |
|rail |[An, attractive, ,, thin, low, profile] |
|profile |[An, attractive, ,, thin, low, profile] |
|Lighter |[Lighter, than, metal, ,, Level, ,, and, tes] |
|than |[Lighter, than, metal, ,, Level, ,, and, tww] |
|steel |[Lighter, than, metal, ,, Level, ,, and, test] |
|, |[Lighter, than, metal, ,, Level, ,, and, Test] |
|Level |[Lighter, than, metal, ,, Level, ,, and, test] |
|, |[Lighter, than, metal, ,, Level, ,, and, ste] |
|and |[Lighter, than, metal, ,, Level, ,, and, ste] |
|Test |[Lighter, than, metal, ,, Level, ,, and, Ste] |
|Renewable |[Renewable, resource] |
|Resource |[Renewable, resource] |
|No |[No1, Bal, testme, saves, time, and, money] |
+----------+-------------------------------------------------+
但它没有显示任何数据。 请帮我解决这个问题。
答案 0 :(得分:1)
为此你可以创建一个简单的udf,将大小写转换为小写和过滤器
这是一个简单的例子,
scala> import spark.implicits._
import spark.implicits._
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df = Seq(("An", List("An", "attractive"," ","", "thin", "low", "profile")), ("Lighter", List("Lighter", "than", "metal"," " ,"", "Level"," " ,"", "and", "tes"))).toDF("col1", "words")
df: org.apache.spark.sql.DataFrame = [col1: string, words: array<string>]
scala> val filterUdf = udf((arr: Seq[String]) => arr.map(_.toLowerCase).contains("level".toLowerCase))
filterUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(ArrayType(StringType,true))))
scala> df.filter(filterUdf($"words")).show(false)
+-------+-------------------------------------------------+
|col1 |words |
+-------+-------------------------------------------------+
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-------------------------------------------------+
希望这有帮助!
答案 1 :(得分:1)
DataSets
比DataFrames
更容易使用,因此我建议您将dataframe
转换为dataset
或从源代码创建DataSet
数据。
假设您有一个case class
为
case class data(col1: String, words: Array[String])
为了便于说明,我正在创建一个临时的dataset
作为
import sqlContext.implicits._
val ds = Seq(
data("profile", Array("An", "attractive", "", "", "thin", "low", "profile")),
data("Lighter", Array("Lighter", "than", "metal", "", "", "Level", "", "", "and", "tes"))
).toDS
类似于您拥有的dataframe
+-------+-----------------------------------------------+
|col1 |words |
+-------+-----------------------------------------------+
|profile|[An, attractive, , , thin, low, profile] |
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-----------------------------------------------+
您可以在包含RDD
的{{1}}中的dataset
和filter
上执行与rows
类似的操作
Level
结果是
ds.filter(row => row.words.map(element => element.toLowerCase).contains("level"))
<强>更新强>
在您努力将+-------+-----------------------------------------------+
|col1 |words |
+-------+-----------------------------------------------+
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-----------------------------------------------+
转换为dataframe
时,这是实现此目标的方法之一
假设您有dataset
(dataframe
)
df
然后应创建+---+-------------+--------+---+
|age|maritalStatus|name |sex|
+---+-------------+--------+---+
|35 |M |Joanna |F |
|25 |S |Isabelle|F |
|19 |S |Andy |M |
|70 |M |Robert |M |
+---+-------------+--------+---+
以匹配case class
的{{1}}
schema
然后更改别名就可以了解
df
然后您可以按照本答复的第一部分所述进行操作。
我希望答案很有帮助