Spark,对组中的行进行编号

时间:2017-02-02 21:24:11

标签: apache-spark

好吧,我可能会离开这里。但是我很难通过足球比赛挖掘一个相当小的数据集来学习Spark的基础知识(来自http://www.football-data.co.uk/englandm.php)。

我已经做过的事情:

  • 阅读英语联赛中所有比赛的所有文件。
  • 将csv-rows'转换'为名为Match的案例类。

代码:

case class Match(
              startTime: Date,
              homeTeam: String,
              awayTeam: String,
              homeGoals: Int,
              awayGoals: Int,
              league: String,
              season: String,
              round: Int = -1,
}

object Parser {
    def main(args: Array[String]): Unit = {

        val spark = SparkSession.builder()
            .appName("test")
            .getOrCreate()

        import spark.implicits._

        val data = spark.read
            .textFile("data/football-data.co.uk/1516/E*.csv")
            .filter(s => !s.startsWith("Season,Div,Date")) // remove headers.
            .map(s => s.split(","))
            .map(a => createMatch(a))
    }
}

现在我想在每个联赛和赛季中对每场比赛进行编号。

我在这里画一个空白。尝试过分区和分组没有任何运气,迫切需要一些指针。

我是不是想在Spark中尝试这样做,因为我需要在联赛+赛季的迭代中保持状态?

2 个答案:

答案 0 :(得分:1)

分区/窗口上的行号生成最好由spark-sql提供的ROW_NUMBER函数完成。

data.createOrReplaceTempView("temp_table")

val newDF = spark.sql(
"""
SELECT 
 <list other columns required>
ROW_NUMBER() OVER(PARTITION BY league,season ORDER BY startTime) as slno
FROM temp_table;
""");

答案 1 :(得分:0)

在rouge-one的回答中提供了帮助,我最终做到了这一点:

val schema =
  StructType(
    Array(
      StructField("Season",          StringType, false),
      StructField("Div",          StringType, false),
      StructField("Date",         DateType, false),
      StructField("HomeTeam",     StringType, false),
      StructField("AwayTeam",     StringType, false),
      StructField("FTHG",     IntegerType, false),
      StructField("FTAG",     IntegerType, false),
      StructField("FTR",     StringType, false),
      StructField("HTHG",     IntegerType, false),
      StructField("HTAG",     IntegerType, false),
      StructField("HTR",     StringType, false),
      StructField("Referee",     StringType, false),
      StructField("HS",     IntegerType, false),
      StructField("AS",     IntegerType, false),
      StructField("HST",     IntegerType, false),
      StructField("AST",     IntegerType, false),
      StructField("HF",     IntegerType, false),
      StructField("AF",     IntegerType, false),
      StructField("HC",     IntegerType, false),
      StructField("AC",     IntegerType, false),
      StructField("HY",     IntegerType, false),
      StructField("AY",     IntegerType, false),
      StructField("HR",     IntegerType, false),
      StructField("AR",     IntegerType, false),
      StructField("B365H",     DoubleType, false),
      StructField("B365D",     DoubleType, false),
      StructField("B365A",     DoubleType, false),
      StructField("BWH",     DoubleType, false),
      StructField("BWD",     DoubleType, false),
      StructField("BWA",     DoubleType, false),
      StructField("IWH",     DoubleType, false),
      StructField("IWD",     DoubleType, false),
      StructField("IWA",     DoubleType, false),
      StructField("LBH",     DoubleType, false),
      StructField("LBD",     DoubleType, false),
      StructField("LBA",     DoubleType, false),
      StructField("PSH",     DoubleType, false),
      StructField("PSD",     DoubleType, false),
      StructField("PSA",     DoubleType, false),
      StructField("WHH",     DoubleType, false),
      StructField("WHD",     DoubleType, false),
      StructField("WHA",     DoubleType, false),
      StructField("VCH",     DoubleType, false),
      StructField("VCD",     DoubleType, false),
      StructField("VCA",     DoubleType, false),
      StructField("Bb1X2",     DoubleType, false),
      StructField("BbMxH",     DoubleType, false),
      StructField("BbAvH",     DoubleType, false),
      StructField("BbMxD",     DoubleType, false),
      StructField("BbAvD",     DoubleType, false),
      StructField("BbMxA",     DoubleType, false),
      StructField("BbAvA",     DoubleType, false),
      StructField("BbOU",     DoubleType, false),
      StructField("BbMx>2.5",     DoubleType, false),
      StructField("BbAv>2.5",     DoubleType, false),
      StructField("BbMx<2.5",     DoubleType, false),
      StructField("BbAv<2.5",     DoubleType, false),
      StructField("BbAH",     DoubleType, false),
      StructField("BbAHh",     DoubleType, false),
      StructField("BbMxAHH",     DoubleType, false),
      StructField("BbAvAHH",     DoubleType, false),
      StructField("BbMxAHA",     DoubleType, false),
      StructField("BbAvAHA",     DoubleType, false),
      StructField("PSCH",     DoubleType, false),
      StructField("PSCD",     DoubleType, false),
      StructField("PSCA",     DoubleType, false)
    )
  )

从这里开始我可以创建我的Match-objects。

架构定义如下所示:

{{1}}

(这对我的用例来说不是很有用,因为我无法使用这个数据集准确地计算圆形。但仍然是很好的练习。)