我已经做过的事情:
代码:
case class Match(
startTime: Date,
homeTeam: String,
awayTeam: String,
homeGoals: Int,
awayGoals: Int,
league: String,
season: String,
round: Int = -1,
}
object Parser {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("test")
.getOrCreate()
import spark.implicits._
val data = spark.read
.textFile("data/football-data.co.uk/1516/E*.csv")
.filter(s => !s.startsWith("Season,Div,Date")) // remove headers.
.map(s => s.split(","))
.map(a => createMatch(a))
}
}
现在我想在每个联赛和赛季中对每场比赛进行编号。
我在这里画一个空白。尝试过分区和分组没有任何运气,迫切需要一些指针。
我是不是想在Spark中尝试这样做,因为我需要在联赛+赛季的迭代中保持状态?
答案 0 :(得分:1)
分区/窗口上的行号生成最好由spark-sql提供的ROW_NUMBER函数完成。
data.createOrReplaceTempView("temp_table")
val newDF = spark.sql(
"""
SELECT
<list other columns required>
ROW_NUMBER() OVER(PARTITION BY league,season ORDER BY startTime) as slno
FROM temp_table;
""");
答案 1 :(得分:0)
在rouge-one的回答中提供了帮助,我最终做到了这一点:
val schema =
StructType(
Array(
StructField("Season", StringType, false),
StructField("Div", StringType, false),
StructField("Date", DateType, false),
StructField("HomeTeam", StringType, false),
StructField("AwayTeam", StringType, false),
StructField("FTHG", IntegerType, false),
StructField("FTAG", IntegerType, false),
StructField("FTR", StringType, false),
StructField("HTHG", IntegerType, false),
StructField("HTAG", IntegerType, false),
StructField("HTR", StringType, false),
StructField("Referee", StringType, false),
StructField("HS", IntegerType, false),
StructField("AS", IntegerType, false),
StructField("HST", IntegerType, false),
StructField("AST", IntegerType, false),
StructField("HF", IntegerType, false),
StructField("AF", IntegerType, false),
StructField("HC", IntegerType, false),
StructField("AC", IntegerType, false),
StructField("HY", IntegerType, false),
StructField("AY", IntegerType, false),
StructField("HR", IntegerType, false),
StructField("AR", IntegerType, false),
StructField("B365H", DoubleType, false),
StructField("B365D", DoubleType, false),
StructField("B365A", DoubleType, false),
StructField("BWH", DoubleType, false),
StructField("BWD", DoubleType, false),
StructField("BWA", DoubleType, false),
StructField("IWH", DoubleType, false),
StructField("IWD", DoubleType, false),
StructField("IWA", DoubleType, false),
StructField("LBH", DoubleType, false),
StructField("LBD", DoubleType, false),
StructField("LBA", DoubleType, false),
StructField("PSH", DoubleType, false),
StructField("PSD", DoubleType, false),
StructField("PSA", DoubleType, false),
StructField("WHH", DoubleType, false),
StructField("WHD", DoubleType, false),
StructField("WHA", DoubleType, false),
StructField("VCH", DoubleType, false),
StructField("VCD", DoubleType, false),
StructField("VCA", DoubleType, false),
StructField("Bb1X2", DoubleType, false),
StructField("BbMxH", DoubleType, false),
StructField("BbAvH", DoubleType, false),
StructField("BbMxD", DoubleType, false),
StructField("BbAvD", DoubleType, false),
StructField("BbMxA", DoubleType, false),
StructField("BbAvA", DoubleType, false),
StructField("BbOU", DoubleType, false),
StructField("BbMx>2.5", DoubleType, false),
StructField("BbAv>2.5", DoubleType, false),
StructField("BbMx<2.5", DoubleType, false),
StructField("BbAv<2.5", DoubleType, false),
StructField("BbAH", DoubleType, false),
StructField("BbAHh", DoubleType, false),
StructField("BbMxAHH", DoubleType, false),
StructField("BbAvAHH", DoubleType, false),
StructField("BbMxAHA", DoubleType, false),
StructField("BbAvAHA", DoubleType, false),
StructField("PSCH", DoubleType, false),
StructField("PSCD", DoubleType, false),
StructField("PSCA", DoubleType, false)
)
)
从这里开始我可以创建我的Match-objects。
架构定义如下所示:
{{1}}
(这对我的用例来说不是很有用,因为我无法使用这个数据集准确地计算圆形。但仍然是很好的练习。)