Scala Spark - 计算Dataframe列

时间:2017-10-29 00:17:22

标签: scala apache-spark dataframe aggregate window-functions

如何使用df分区的Spark来计算id列中String的出现次数?

e.g。在"test"

的列"name"中找到值df

在SQL中将是:

 SELECT
    SUM(CASE WHEN name = 'test' THEN 1 else 0 END) over window AS cnt_test
  FROM
    mytable
 WINDOW window AS (PARTITION BY id)

我已尝试使用map( v => match { case "test" -> 1.. })

等等:

def getCount(df: DataFrame): DataFrame = {
    val dfCnt = df.agg(
          .withColumn("cnt_test", 
            count(col("name")==lit('test'))
)

这是一项昂贵的操作吗?什么是检查特定字符串出现然后执行操作(sum, max, min, etc)的最佳方法?

感谢

2 个答案:

答案 0 :(得分:7)

您可以在spark中使用groupBy + agg;如果when($"name" == "test", 1),则name1列转换为name == 'test',否则将null转换为countdf.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test")) 将非空值计数:

val df = Seq(("a", "joe"), ("b", "test"), ("b", "john")).toDF("id", "name")
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+

实施例

df.groupBy("id").agg(sum(when($"name" === "test", 1).otherwise(0)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+

或类似于您的SQL查询:

    Brgytxt = cmbCity.Text
    myConnToAccess = New OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source=NBCDB.accdb")
    myConnToAccess.Open()
    ds = New DataSet
    tables = ds.Tables
    da = New OleDbDataAdapter("SELECT * from " & Brgytxt, myConnToAccess)
    da.Fill(ds, Brgytxt)
    With ComboBox13
        .DataSource = ds.Tables(Brgytxt)
        .DisplayMember = Brgytxt
        .SelectedIndex = -1
    End With
    TextBox9.Text = Brgytxt
End Sub

答案 1 :(得分:0)

如果你想翻译你的SQL,你也可以在Spark中使用Window函数:

def getCount(df: DataFrame): DataFrame = {
  import org.apache.spark.sql.expressions.Window

  df.withColumn("cnt_test",
      sum(when($"name" === "test", 1).otherwise(0)).over(Window.partitionBy($"id"))
    )
}