我有以下DataFrame:
===========================================================================
Modell 1 Modell 2 Modell 3 Modell 4
---------------------------------------------------------------------------
(Intercept) -0.122 -0.112 -0.122 -0.097
(0.116) (0.115) (0.116) (0.116)
IV1 0.026 -0.083
(0.030) (0.064)
CV1 0.017 0.015 0.018 0.012
(0.019) (0.019) (0.019) (0.019)
CV2 0.064*** 0.063*** 0.066*** 0.063***
(0.016) (0.015) (0.016) (0.016)
IV3 0.076 0.196*
(0.045) (0.099)
IV4 -0.016 -0.047
(0.076) (0.078)
---------------------------------------------------------------------------
作为临时表name,email,phone,country
------------------------------------------------
[Mike,mike@example.com,+91-9999999999,Italy]
[Alex,alex@example.com,+91-9999999998,France]
[John,john@example.com,+1-1111111111,United States]
[Donald,donald@example.com,+1-2222222222,United States]
[Dan,dan@example.com,+91-9999444999,Poland]
[Scott,scott@example.com,+91-9111999998,Spain]
[Rob,rob@example.com,+91-9114444998,Italy]
公开:
tagged_users
我需要向此DataFrame添加额外的列resultDf.createOrReplaceTempView("tagged_users")
并通过不同的SQL条件分配计算得出的标签,这些条件在下面的映射中进行了描述(键-标签名称,值-tag
子句的条件)
WHERE
我具有以下DataFrames(作为数据字典),以便能够在SQL查询中使用它们:
val tags = Map(
"big" -> "country IN (SELECT * FROM big_countries)",
"medium" -> "country IN (SELECT * FROM medium_countries)",
//2000 other different tags and conditions
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'"
)
我想测试Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
表中的每一行并为其分配适当的标记。为了达到目的,我尝试实现以下逻辑:
tagged_users
但是现在我不知道如何累积标签而不覆盖它们。现在,我得到以下DataFrame:
tags.foreach {
case (tag, tagCondition) => {
resultDf = spark.sql(buildTagQuery(tag, tagCondition, "tagged_users"))
.withColumn("tag", lit(tag).cast(StringType))
}
}
def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
f"SELECT * FROM $table WHERE $tagCondition"
}
但是我需要类似的东西
name,email,phone,country,tag
Dan,dan@example.com,+91-9999444999,Poland,medium
Scott,scott@example.com,+91-9111999998,Spain,medium
请注意,name,email,phone,country,tag
Mike,mike@example.com,+91-9999999999,Italy,big
Alex,alex@example.com,+91-9999999998,France,big
John,john@example.com,+1-1111111111,United States,big
Donald,donald@example.com,+1-2222222222,United States,(big|sometag)
Dan,dan@example.com,+91-9999444999,Poland,medium
Scott,scott@example.com,+91-9111999998,Spain,(big|medium)
Rob,rob@example.com,+91-9114444998,Italy,big
应该有2个标签Donal
,而(big|sometag)
应该有2个标签Scott
。
请说明如何实现。
已更新
(big|medium)
失败,但以下情况除外:
val spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.master", "local")
.getOrCreate();
import spark.implicits._
import spark.sql
Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
val df = Seq(
("Mike", "mike@example.com", "+91-9999999999", "Italy"),
("Alex", "alex@example.com", "+91-9999999998", "France"),
("John", "john@example.com", "+1-1111111111", "United States"),
("Donald", "donald@example.com", "+1-2222222222", "United States"),
("Dan", "dan@example.com", "+91-9999444999", "Poland"),
("Scott", "scott@example.com", "+91-9111999998", "Spain"),
("Rob", "rob@example.com", "+91-9114444998", "Italy")).toDF("name", "email", "phone", "country")
df.collect.foreach(println)
df.createOrReplaceTempView("tagged_users")
val tags = Map(
"big" -> "country IN (SELECT * FROM big_countries)",
"medium" -> "country IN (SELECT * FROM medium_countries)",
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
val sep_tag = tags.map((x) => { s"when array_contains(" + x._1 + ", country) then '" + x._1 + "' " }).mkString
val combine_sel_tag1 = tags.map((x) => { s" array_contains(" + x._1 + ",country) " }).mkString(" and ")
val combine_sel_tag2 = tags.map((x) => x._1).mkString(" '(", "|", ")' ")
val combine_sel_all = " case when " + combine_sel_tag1 + " then " + combine_sel_tag2 + sep_tag + " end as tags "
val crosqry = tags.map((x) => { s" cross join ( select collect_list(country) as " + x._1 + " from " + x._1 + "_countries) " + x._1 + " " }).mkString
val qry = " select name,email,phone,country, " + combine_sel_all + " from tagged_users " + crosqry
spark.sql(qry).show
spark.stop()
答案 0 :(得分:1)
如果您需要汇总结果,而不仅仅是执行每个查询,则可以使用map而不是foreach,然后合并结果
val o = tags.map {
case (tag, tagCondition) => {
val resultDf = spark.sql(buildTagQuery(tag, tagCondition, "tagged_users"))
.withColumn("tag", new Column("blah"))
resultDf
}
}
o.foldLeft(o.head) {
case (acc, df) => acc.union(df)
}
答案 1 :(得分:1)
查看此DF解决方案:
scala> val df = Seq(("Mike","mike@example.com","+91-9999999999","Italy"),
| ("Alex","alex@example.com","+91-9999999998","France"),
| ("John","john@example.com","+1-1111111111","United States"),
| ("Donald","donald@example.com","+1-2222222222","United States"),
| ("Dan","dan@example.com","+91-9999444999","Poland"),
| ("Scott","scott@example.com","+91-9111999998","Spain"),
| ("Rob","rob@example.com","+91-9114444998","Italy")
| ).toDF("name","email","phone","country")
df: org.apache.spark.sql.DataFrame = [name: string, email: string ... 2 more fields]
scala> val dfbc=Seq("Italy", "France", "United States", "Spain").toDF("country")
dfbc: org.apache.spark.sql.DataFrame = [country: string]
scala> val dfmc=Seq("Poland", "Hungary", "Spain").toDF("country")
dfmc: org.apache.spark.sql.DataFrame = [country: string]
scala> val dfbc2=dfbc.agg(collect_list('country).as("bcountry"))
dfbc2: org.apache.spark.sql.DataFrame = [bcountry: array<string>]
scala> val dfmc2=dfmc.agg(collect_list('country).as("mcountry"))
dfmc2: org.apache.spark.sql.DataFrame = [mcountry: array<string>]
scala> val df2=df.crossJoin(dfbc2).crossJoin(dfmc2)
df2: org.apache.spark.sql.DataFrame = [name: string, email: string ... 4 more fields]
scala> df2.selectExpr("*","case when array_contains(bcountry,country) and array_contains(mcountry,country) then '(big|medium)' when array_contains(bcountry,country) then 'big' when array_contains(mcountry,country) then 'medium' else 'none' end as `tags`").select("name","email","phone","country","tags").show(false)
+------+------------------+--------------+-------------+------------+
|name |email |phone |country |tags |
+------+------------------+--------------+-------------+------------+
|Mike |mike@example.com |+91-9999999999|Italy |big |
|Alex |alex@example.com |+91-9999999998|France |big |
|John |john@example.com |+1-1111111111 |United States|big |
|Donald|donald@example.com|+1-2222222222 |United States|big |
|Dan |dan@example.com |+91-9999444999|Poland |medium |
|Scott |scott@example.com |+91-9111999998|Spain |(big|medium)|
|Rob |rob@example.com |+91-9114444998|Italy |big |
+------+------------------+--------------+-------------+------------+
scala>
SQL方法
scala> Seq(("Mike","mike@example.com","+91-9999999999","Italy"),
| ("Alex","alex@example.com","+91-9999999998","France"),
| ("John","john@example.com","+1-1111111111","United States"),
| ("Donald","donald@example.com","+1-2222222222","United States"),
| ("Dan","dan@example.com","+91-9999444999","Poland"),
| ("Scott","scott@example.com","+91-9111999998","Spain"),
| ("Rob","rob@example.com","+91-9114444998","Italy")
| ).toDF("name","email","phone","country").createOrReplaceTempView("tagged_users")
scala> Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
scala> Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
scala> spark.sql(""" select name,email,phone,country,case when array_contains(bc,country) and array_contains(mc,country) then '(big|medium)' when array_contains(bc,country) then 'big' when array_contains(mc,country) then 'medium' else 'none' end as tags from tagged_users cross join ( select collect_list(country) as bc from big_countries ) b cross join ( select collect_list(country) as mc from medium_countries ) c """).show(false)
+------+------------------+--------------+-------------+------------+
|name |email |phone |country |tags |
+------+------------------+--------------+-------------+------------+
|Mike |mike@example.com |+91-9999999999|Italy |big |
|Alex |alex@example.com |+91-9999999998|France |big |
|John |john@example.com |+1-1111111111 |United States|big |
|Donald|donald@example.com|+1-2222222222 |United States|big |
|Dan |dan@example.com |+91-9999444999|Poland |medium |
|Scott |scott@example.com |+91-9111999998|Spain |(big|medium)|
|Rob |rob@example.com |+91-9114444998|Italy |big |
+------+------------------+--------------+-------------+------------+
scala>
遍历标签
scala> val tags = Map(
| "big" -> "country IN (SELECT * FROM big_countries)",
| "medium" -> "country IN (SELECT * FROM medium_countries)")
tags: scala.collection.immutable.Map[String,String] = Map(big -> country IN (SELECT * FROM big_countries), medium -> country IN (SELECT * FROM medium_countries))
scala> val sep_tag = tags.map( (x) => { s"when array_contains("+x._1+", country) then '" + x._1 + "' " } ).mkString
sep_tag: String = "when array_contains(big, country) then 'big' when array_contains(medium, country) then 'medium' "
scala> val combine_sel_tag1 = tags.map( (x) => { s" array_contains("+x._1+",country) " } ).mkString(" and ")
combine_sel_tag1: String = " array_contains(big,country) and array_contains(medium,country) "
scala> val combine_sel_tag2 = tags.map( (x) => x._1 ).mkString(" '(","|", ")' ")
combine_sel_tag2: String = " '(big|medium)' "
scala> val combine_sel_all = " case when " + combine_sel_tag1 + " then " + combine_sel_tag2 + sep_tag + " end as tags "
combine_sel_all: String = " case when array_contains(big,country) and array_contains(medium,country) then '(big|medium)' when array_contains(big, country) then 'big' when array_contains(medium, country) then 'medium' end as tags "
scala> val crosqry = tags.map( (x) => { s" cross join ( select collect_list(country) as "+x._1+" from "+x._1+"_countries) "+ x._1 + " " } ).mkString
crosqry: String = " cross join ( select collect_list(country) as big from big_countries) big cross join ( select collect_list(country) as medium from medium_countries) medium "
scala> val qry = " select name,email,phone,country, " + combine_sel_all + " from tagged_users " + crosqry
qry: String = " select name,email,phone,country, case when array_contains(big,country) and array_contains(medium,country) then '(big|medium)' when array_contains(big, country) then 'big' when array_contains(medium, country) then 'medium' end as tags from tagged_users cross join ( select collect_list(country) as big from big_countries) big cross join ( select collect_list(country) as medium from medium_countries) medium "
scala> spark.sql(qry).show
+------+------------------+--------------+-------------+------------+
| name| email| phone| country| tags|
+------+------------------+--------------+-------------+------------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain|(big|medium)|
| Rob| rob@example.com|+91-9114444998| Italy| big|
+------+------------------+--------------+-------------+------------+
scala>
UPDATE2:
scala> Seq(("Mike","mike@example.com","+91-9999999999","Italy"),
| ("Alex","alex@example.com","+91-9999999998","France"),
| ("John","john@example.com","+1-1111111111","United States"),
| ("Donald","donald@example.com","+1-2222222222","United States"),
| ("Dan","dan@example.com","+91-9999444999","Poland"),
| ("Scott","scott@example.com","+91-9111999998","Spain"),
| ("Rob","rob@example.com","+91-9114444998","Italy")
| ).toDF("name","email","phone","country").createOrReplaceTempView("tagged_users")
scala> Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
scala> Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
scala> val tags = Map(
| "big" -> "country IN (SELECT * FROM big_countries)",
| "medium" -> "country IN (SELECT * FROM medium_countries)",
| "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
tags: scala.collection.immutable.Map[String,String] = Map(big -> country IN (SELECT * FROM big_countries), medium -> country IN (SELECT * FROM medium_countries), sometag -> name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222')
scala> val sql_tags = tags.map( x => { val p = x._2.trim.toUpperCase.split(" ");
| val qry = if(p.contains("IN") && p.contains("FROM"))
| s" case when array_contains((select collect_list("+p.head +") from " + p.last.replaceAll("[)]","")+ " ), " +p.head + " ) then '" + x._1 + " ' else '' end " + x._1 + " "
| else
| " case when " + x._2 + " then '" + x._1 + " ' else '' end " + x._1 + " ";
| qry } ).mkString(",")
sql_tags: String = " case when array_contains((select collect_list(COUNTRY) from BIG_COUNTRIES ), COUNTRY ) then 'big ' else '' end big , case when array_contains((select collect_list(COUNTRY) from MEDIUM_COUNTRIES ), COUNTRY ) then 'medium ' else '' end medium , case when name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222' then 'sometag ' else '' end sometag "
scala> val outer_query = tags.map( x=> x._1).mkString(" regexp_replace(trim(concat(", ",", " )),' ','|') tags ")
outer_query: String = " regexp_replace(trim(concat(big,medium,sometag )),' ','|') tags "
scala> spark.sql(" select name,email, country, " + outer_query + " from ( select name,email, country ," + sql_tags + " from tagged_users ) " ).show
+------+------------------+-------------+-----------+
| name| email| country| tags|
+------+------------------+-------------+-----------+
| Mike| mike@example.com| Italy| big|
| Alex| alex@example.com| France| big|
| John| john@example.com|United States| big|
|Donald|donald@example.com|United States|big|sometag|
| Dan| dan@example.com| Poland| medium|
| Scott| scott@example.com| Spain| big|medium|
| Rob| rob@example.com| Italy| big|
+------+------------------+-------------+-----------+
scala>
答案 2 :(得分:0)
我将用列值tag定义多个标签表。
然后,您的标签定义将是一个名为Seq [(String,String]的集合,其中第一个元组元素是在其上计算标签的列。
让我们说
String s
然后遍历此列表,将相关列上的每个表与关联的列左连接。
在联接每个表之后,只需将您的标签列选择为当前值,如果不为空,则选择联接的列。