答案 0 :(得分:5)
首先使用文字数据框架输入您的数据:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([
(1,'female',233),
(None,'female',314),
(0,'female',81),
(1, None, 342),
(1, 'male', 109),
(None, None, 891),
(0, None, 549),
(None, 'male', 577),
(0, None, 468)
],
['survived', 'sex', 'count'])
然后,我们使用窗口函数来计算包含完整行集的分区上的计数总和(本质上是总计数):
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df = df.withColumn('percent', f.col('count')/f.sum('count').over(Window.partitionBy()))
df.orderBy('percent', ascending=False).show()
+--------+------+-----+--------------------+
|survived| sex|count| percent|
+--------+------+-----+--------------------+
| null| null| 891| 0.25|
| null| male| 577| 0.16189674523007858|
| 0| null| 549| 0.15404040404040403|
| 0| null| 468| 0.13131313131313133|
| 1| null| 342| 0.09595959595959595|
| null|female| 314| 0.08810325476992144|
| 1|female| 233| 0.0653759820426487|
| 1| male| 109| 0.03058361391694725|
| 0|female| 81|0.022727272727272728|
+--------+------+-----+--------------------+
如果将上述步骤分为两步,则很容易看到窗口函数sum
只是向每一行
total
值
df = df\
.withColumn('total', f.sum('count').over(Window.partitionBy()))\
.withColumn('percent', f.col('count')/f.col('total'))
df.show()
+--------+------+-----+--------------------+-----+
|survived| sex|count| percent|total|
+--------+------+-----+--------------------+-----+
| 1|female| 233| 0.0653759820426487| 3564|
| null|female| 314| 0.08810325476992144| 3564|
| 0|female| 81|0.022727272727272728| 3564|
| 1| null| 342| 0.09595959595959595| 3564|
| 1| male| 109| 0.03058361391694725| 3564|
| null| null| 891| 0.25| 3564|
| 0| null| 549| 0.15404040404040403| 3564|
| null| male| 577| 0.16189674523007858| 3564|
| 0| null| 468| 0.13131313131313133| 3564|
+--------+------+-----+--------------------+-----+
答案 1 :(得分:1)
这可能是使用Spark的选项,因为它最有可能被使用(即,它不涉及向驱动程序显式收集数据,也不会导致生成任何警告:
df = spark.createDataFrame([
(1,'female',233),
(None,'female',314),
(0,'female',81),
(1, None, 342),
(1, 'male', 109),
(None, None, 891),
(0, None, 549),
(None, 'male', 577),
(0, None, 468)
],
['survived', 'sex', 'count'])
df.registerTempTable("df")
sql = """
select *, count/(select sum(count) from df) as percentage
from df
"""
spark.sql(sql).show()
请注意,对于通常在Spark中处理的类型较大的数据集,您不希望将解决方案与跨越整个数据集的window
一起使用(例如w = Window.partitionBy()
) 。实际上,Spark会就此警告您:
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
为说明差异,这是非窗口版本
sql = """
select *, count/(select sum(count) from df) as percentage
from df
"""
请注意,所有9行在任何时候都不会被改组为单个执行器。
以下是带有窗口的版本:
sql = """
select *, count/sum(count) over () as perc
from df
"""
请注意,交换(随机)步骤中以及进行单分区数据交换的位置中的数据量更大:
答案 2 :(得分:0)
类似下面的内容应该有效。
df = sc.parallelize([(1,'female',233), (None,'female',314),(0,'female',81),(1, None, 342), (1, 'male', 109)]).toDF().withColumnRenamed("_1","survived").withColumnRenamed("_2","sex").withColumnRenamed("_3","count")
total = df.select("count").agg({"count": "sum"}).collect().pop()['sum(count)']
result = df.withColumn('percent', (df['count']/total) * 100)
result.show()
+--------+------+-----+------------------+
|survived| sex|count| percent|
+--------+------+-----+------------------+
| 1|female| 233| 21.59406858202039|
| null|female| 314|29.101019462465246|
| 0|female| 81| 7.506950880444857|
| 1| null| 342| 31.69601482854495|
| 1| male| 109|10.101946246524559|
+--------+------+-----+------------------+
答案 3 :(得分:0)
你需要:
- 计算总和
- 创建<script>
function updateColor(){
var col1 = document.getElementById('color1').value;
var col2 = document.getElementById('color2').value;
var myCell = document.getElementById('mycell');
myCell.style.background = "linear-gradient(to right," + col1 + "," col2 + ");";
}
</script>
<select id="color1">
<option value="0">set background color
<option value="red">red
<option value="yellow">yellow
<option value="blue">blue
<option value="green">green
<option value="black">black
</select><br />
<select id="color2">
<option value="0">set background color
<option value="red">red
<option value="yellow">yellow
<option value="blue">blue
<option value="green">green
<option value="black">black
</select>
<button onclick="updatecolor();">Gradient</button>
<table border="1">
<tr>
<td>One
</td>
<td>Two
</td>
<td id="mycell">Three
</td>
<td>Four
</td>
</tr>
</table>
以查找百分比
- 并为结果添加一列。
答案 4 :(得分:0)
假设您的df包含a,b,c,d列,您需要针对它们在各列的总和中找到百分比。这是您可以执行的操作。这比窗口功能快:)
import pyspark.sql.functions as fn
divideDF = df.agg(fn.sum('a').alias('a1'),
fn.sum('b').alias('b1'),
fn.sum('c').alias('c1'),
fn.sum('d').alias('d1'))
divideDF=divideDF.take(1)
a1=divideDF[0]['a1']
b1=divideDF[0]['b1']
c1=divideDF[0]['c1']
d1=divideDF[0]['d1']
df=df.withColumn('a_percentage', fn.lit(100)*(fn.col('a')/fn.lit(a1)))
df=df.withColumn('b_percentage', fn.lit(100)*(fn.col('b')/fn.lit(b1)))
df=df.withColumn('c_percentage', fn.lit(100)*(fn.col('c')/fn.lit(c1)))
df=df.withColumn('d_percentage', fn.lit(100)*(fn.col('d')/fn.lit(d1)))
df.show()
享受!
答案 5 :(得分:0)
如果有人想通过将两列相除来计算百分比,那么该代码就在下面,因为该代码仅从上述逻辑派生而来,您可以放置任意数量的列,因为我只接受了薪金列,这样我将获得100%的收益:
from pyspark .sql.functions import *
dfm = df.select(((col('Salary')) / (col('Salary')))*100)
df =df.withColumn('dfm',(col('Salary')/(col('Salary')) *100))
df.show()