我正在寻找pandas数据帧的pyspark等价物。 特别是,我想在pyspark dataframe上执行以下操作
# in pandas dataframe, I can do the following operation
# assuming df = pandas dataframe
index = df['column_A'] > 0.0
amount = sum(df.loc[index, 'column_B'] * df.loc[index, 'column_C'])
/ sum(df.loc[index, 'column_C'])
我想知道对pyspark数据帧执行此操作的pyspark等价是什么?
答案 0 :(得分:2)
这很简单,可以使用RDD
(我不熟悉spark.sql.DataFrame
):
x, y = (df.rdd
.filter(lambda x: x.column_A > 0.0)
.map(lambda x: (x.column_B*x.column_C, x.column_C))
.reduce(lambda x, y: (x[0]+y[0], x[1]+y[1])))
amount = x / y
或filter
DataFrame
然后跳转到RDD
:
x, y = (df
.filter(df.column_A > 0.0)
.rdd
.map(lambda x: (x.column_B*x.column_C, x.column_C))
.reduce(lambda x, y: (x[0]+y[0], x[1]+y[1])))
amount = x / y
经过一番挖掘后,不确定这是否是最有效的方法,但没有进入RDD
:
x, y = (df
.filter(df.column_A > 0.0)
.select((df.column_B * df.column_C).alias("product"), df.column_C)
.agg({'product': 'sum', 'column_C':'sum'})).first()
amount = x / y
答案 1 :(得分:2)
Spark DataFrame
没有严格的顺序,因此索引没有意义。相反,我们使用类似SQL的DSL。在这里,您可以使用where
(filter
)和select
。如果数据如下所示:
import pandas as pd
import numpy as np
from pyspark.sql.functions import col, sum as sum_
np.random.seed(1)
df = pd.DataFrame({
c: np.random.randn(1000) for c in ["column_A", "column_B", "column_C"]
})
amount
将是
amount
# 0.9334143225687774
和Spark等价物是:
sdf = spark.createDataFrame(df)
(amount_, ) = (sdf
.where(sdf.column_A > 0.0)
.select(sum_(sdf.column_B * sdf.column_C) / sum_(sdf.column_C))
.first())
结果在数值上是等价的:
abs(amount - amount_)
# 1.1102230246251565e-16
你也可以使用条件:
from pyspark.sql.functions import when
pred = col("column_A") > 0.0
amount_expr = sum_(
when(pred, col("column_B")) * when(pred, col("column_C"))
) / sum_(when(pred, col("column_C")))
sdf.select(amount_expr).first()[0]
# 0.9334143225687773
看起来更像熊猫,但更加冗长。
答案 2 :(得分:0)
更多快速的Pysparky答案
$shareName = "myshare"
New-AzRmStorageShare `
-ResourceGroupName $resourceGroupName `
-StorageAccountName $storageAccountName `
-Name $shareName `
-QuotaGiB 1024 | Out-Null