我的数据框如下
id value
1 I am a boy
1 I am a men
1 I am afather
2 I am a girl
2 I am awomen
2 I am a mother
我有2个列表,如下所示:-
val male = List("boy", "men", "father")
val female = List("girl", "women", "mother")
我想在值列中搜索列表中字符串之一的部分匹配项,并创建一个如下所示的数据框:-
id value gender
1 I am a boy male
1 I am a men male
1 I am a father male
2 I am a girl female
2 I am a women female
2 I am a mother female
我正在使用Scala进行编程。只想检查列中的子字符串。而且我无法拆分列中的值,因为它们的格式没有正确地用空格设置,但是列表中的字符串存在。
答案 0 :(得分:0)
使用rdd方式。
scala> val df = Seq((1,"I am a boy"),
| (1,"I am a men"),
| (1,"I am a father"),
| (2,"I am a girl"),
| (2,"I am a women"),
| (2,"I am a mother")).toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: int, value: string]
scala> val male = List("boy", "men", "father")
male: List[String] = List(boy, men, father)
scala> val female = List("girl", "women", "mother")
female: List[String] = List(girl, women, mother)
scala> val rdd2 = df.rdd.map( x => { val p = if(male.intersect(x(1).toString.split(" ")).length > 0) "male" else if (female.intersect(x(1).toString.split(" ")).length > 0) "female" else "none" ; Row(x(0),x(1),p) } )
rdd2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[26] at map at <console>:41
scala> val schema = df.schema.add(StructField("gender",StringType))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,false), StructField(value,StringType,true), StructField(gender,StringType,true))
scala> spark.createDataFrame(rdd2,schema).show
+---+-------------+------+
| id| value|gender|
+---+-------------+------+
| 1| I am a boy| male|
| 1| I am a men| male|
| 1|I am a father| male|
| 2| I am a girl|female|
| 2| I am a women|female|
| 2|I am a mother|female|
+---+-------------+------+
scala>