因此,基本上,我需要使用Spark SQL运行以下两个查询,但是如果我的编译器未抛出一些随机错误,我将无法弄清楚如何进行ID识别。
查询1:
SELECT DISTINCT c.name, count(p.pid)FROM clubs c
JOIN teams t on c.cid = t.cid
JOIN tournaments d on t.tid = t.tid
JOIN players p on p.ncid = c.ncid
WHERE c.cid = 45 AND d.tyear = 2014
GROUP BY c.name
ORDER BY count DESC
查询2:
SELECT DISTINCT t.tyear, c.name, (SELECT max(m.matchdate) - min(m.matchdate) FROM matches m WHERE t.tyear = date_part('year', m.matchdate)) AS days FROM tournaments t
JOIN hosts h ON t.tyear = h.tyear
JOIN countries c on c.cid = h.cid
JOIN stadiums s on s.cid = c.cid
JOIN matches m on m.sid = s.sid
GROUP BY t.tyear, c.name, s.sid
ORDER BY days DESC
ive已经尝试像这样运行查询1,但没有成功:
spark1 = SparkSession.builder.appName('spark').getOrCreate()
teams = spark1.read.csv("teams.csv", header = True, mode="DROPMALFORMED").cache()
clubs = spark1.read.csv("clubs.csv", header=True, mode="DROPMALFORMED").cache()
tournaments = spark1.read.csv("tournaments.csv", header = True, mode="DROPMALFORMED")
players = spark1.read.csv("players.csv", header = True, mode="DROPMALFORMED")
spark1.sql('SELECT DISTINCT c.name, count(p.pid)FROM clubs JOIN teams t on c.cid = t.cid JOIN tournaments d on t.tid = t.tid JOIN players p on p.ncid = c.ncid WHERE c.cid = 45 AND d.tyear = 2014 GROUP BY c.name ORDER BY count DESC').show()
任何帮助将不胜感激!