假设您有一个包含3列数字的数据框,如:
>>> df.show()
+-----------+---------------+------------------+--------------+---------------------+
| IP| URL|num_rebuf_sessions|first_rebuffer|num_backward_sessions|
+-----------+---------------+------------------+--------------+---------------------+
|10.45.12.13| ww.tre.com/ada| 1261| 764| 2043|
|10.54.12.34|www.rwr.com/yuy| 1126| 295| 1376|
|10.44.23.09|www.453.kig/827| 2725| 678| 1036|
|10.23.43.14|www.res.poe/skh| 2438| 224| 1455|
|10.32.22.10|www.res.poe/skh| 3655| 157| 1838|
|10.45.12.13|www.453.kig/827| 7578| 63| 1754|
|10.45.12.13| ww.tre.com/ada| 3854| 448| 1224|
|10.34.22.10|www.rwr.com/yuy| 1029| 758| 1275|
|10.54.12.34| ww.tre.com/ada| 7341| 10| 856|
|10.34.22.10| ww.tre.com/ada| 4150| 455| 1372|
+-----------+---------------+------------------+--------------+---------------------+
架构为
>>> df.printSchema()
root
|-- IP: string (nullable = true)
|-- URL: string (nullable = true)
|-- num_rebuf_sessions: long (nullable = false)
|-- first_rebuffer: long (nullable = false)
|-- num_backward_sessions: long (nullable = false)
问题
我有兴趣找到一个复杂的查询聚合 - 比如说(sum(num_rebuf_sessions) - sum(num_backward_sessions)) * 100 / sum(first_rebuffer)
我如何以编程方式执行此操作? 查询聚合可以是作为输入提供的任何内容(假设是xml文件或json文件)
注意: 在解释器中,我可以运行像
这样的完整语句>>> df.groupBy(keyList).agg((((func.sum('num_rebuf_sessions') - func.sum('first_rebuffer')) * 100)/func.sum('num_backward_sessions')).alias('Result')).show()
+-----------+---------------+------------------+
| IP| URL| Result|
+-----------+---------------+------------------+
|10.54.12.34|www.rwr.com/yuy|263.70753561548884|
|10.23.43.14|www.453.kig/827| 278.7099317601011|
|10.34.22.10| ww.tre.com/ada|187.53939800299088|
+-----------+---------------+------------------+
dict
或list of column
,这意味着难以实现上述功能。只剩下pyspark.sql.context.SQLContext.sql
选项吗?或者我错过了一些明显的东西?