如何使用df
分区的Spark来计算id
列中String的出现次数?
e.g。在"test"
"name"
中找到值df
在SQL中将是:
SELECT
SUM(CASE WHEN name = 'test' THEN 1 else 0 END) over window AS cnt_test
FROM
mytable
WINDOW window AS (PARTITION BY id)
我已尝试使用map( v => match { case "test" -> 1.. })
等等:
def getCount(df: DataFrame): DataFrame = {
val dfCnt = df.agg(
.withColumn("cnt_test",
count(col("name")==lit('test'))
)
这是一项昂贵的操作吗?什么是检查特定字符串出现然后执行操作(sum, max, min, etc)
的最佳方法?
感谢
答案 0 :(得分:7)
您可以在spark中使用groupBy
+ agg
;如果when($"name" == "test", 1)
,则name
将1
列转换为name == 'test'
,否则将null
转换为count
,df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test"))
将非空值计数:
val df = Seq(("a", "joe"), ("b", "test"), ("b", "john")).toDF("id", "name")
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
| b| 1|
| a| 0|
+---+--------+
实施例:
df.groupBy("id").agg(sum(when($"name" === "test", 1).otherwise(0)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
| b| 1|
| a| 0|
+---+--------+
或类似于您的SQL查询:
Brgytxt = cmbCity.Text
myConnToAccess = New OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source=NBCDB.accdb")
myConnToAccess.Open()
ds = New DataSet
tables = ds.Tables
da = New OleDbDataAdapter("SELECT * from " & Brgytxt, myConnToAccess)
da.Fill(ds, Brgytxt)
With ComboBox13
.DataSource = ds.Tables(Brgytxt)
.DisplayMember = Brgytxt
.SelectedIndex = -1
End With
TextBox9.Text = Brgytxt
End Sub
答案 1 :(得分:0)
如果你想翻译你的SQL,你也可以在Spark中使用Window函数:
def getCount(df: DataFrame): DataFrame = {
import org.apache.spark.sql.expressions.Window
df.withColumn("cnt_test",
sum(when($"name" === "test", 1).otherwise(0)).over(Window.partitionBy($"id"))
)
}