如何在pyspark中求和?

时间:2018-10-01 10:36:56

标签: pyspark

我有一个下表,我只想对94${EXTENSTION PATH} ./modheader_extension.crx ${chrome options}= Evaluate sys.modules['selenium.webdriver'].ChromeOptions() sys, selenium.webdriver Call Method ${chrome options} add_extension ${EXTENSTION PATH} Create Webdriver Chrome chrome_options=${chrome options} Goto ${BASE URL} 列求和,但是我遇到了错误

_10

1 个答案:

答案 0 :(得分:0)

我不确定您对总和的含义。如果要对整个列的值求和,则可以使用agg函数。或者,如果您想像_10 + _12这样求和并创建一个新列,请使用withColumn函数

>>> data = sc.parallelize([
...     ('yearID','H','3B'),
...     ('2004','0','0'),
...     ('2006','0','0'),
...     ('2007','0','0'),
...     ('2008','0','0'),
...     ('2009','0','0'),
...     ('2010','0','0'),
...     ('1954','131','6'),
...     ('1955','189','9'),
...     ('1956','200','14'),
...     ('1957','198','6')
...     ])
>>> 
>>> cols = ['_2','_10','_12']
>>> 
>>> df = spark.createDataFrame(data,cols)
18/10/01 04:22:48 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
>>> 
>>> df.show()
+------+---+---+
|    _2|_10|_12|
+------+---+---+
|yearID|  H| 3B|
|  2004|  0|  0|
|  2006|  0|  0|
|  2007|  0|  0|
|  2008|  0|  0|
|  2009|  0|  0|
|  2010|  0|  0|
|  1954|131|  6|
|  1955|189|  9|
|  1956|200| 14|
|  1957|198|  6|
+------+---+---+

>>> df.agg({'_10':'sum','_12':'sum'}).show()
+--------+--------+
|sum(_12)|sum(_10)|
+--------+--------+
|    35.0|   718.0|
+--------+--------+

>>> df.withColumn('new_col', df['_10']+df['_12']).show()
+------+---+---+-------+
|    _2|_10|_12|new_col|
+------+---+---+-------+
|yearID|  H| 3B|   null|
|  2004|  0|  0|    0.0|
|  2006|  0|  0|    0.0|
|  2007|  0|  0|    0.0|
|  2008|  0|  0|    0.0|
|  2009|  0|  0|    0.0|
|  2010|  0|  0|    0.0|
|  1954|131|  6|  137.0|
|  1955|189|  9|  198.0|
|  1956|200| 14|  214.0|
|  1957|198|  6|  204.0|
+------+---+---+-------+