Pyspark专栏的十分位数或其他分位数等级

时间:2017-09-26 18:15:17

标签: pyspark

我有一个带有多个数字列的pyspark DF,我希望每列根据每个变量计算该行的十分位数或其他分位数等级。

这对于pandas来说很简单,因为我们可以使用qcut函数为每个变量创建一个新列,为'q'分配值0到n-1,如pd.qcut(x,q = n)。

如何在pyspark中完成?我已经尝试了以下但很明显这些三分之一的断点并不是唯一的。我想得到分配1的下1/3,下一个1/3分配2和前1/3分配3.我希望能够改变这个并且可能使用1 / 10,1 / 32等

w =  Window.partitionBy(data.var1).orderBy(data.var1)
d2=df.select(
    "var1",
     ntile(3).over(w).alias("ntile3")
)


agged=d2.groupby('ntile3').agg(F.min("var1").alias("min_var1"),F.max("var1").alias("max_var1"),F.count('*'))
agged.show()

+------+--------+--------+--------+
|ntile3|min_var1|max_var1|count(1)|
+------+--------+--------+--------+
|     1|     0.0|   210.0|  517037|
|     3|     0.0|   206.0|  516917|
|     2|     0.0|   210.0|  516962|
+------+--------+--------+--------+

2 个答案:

答案 0 :(得分:6)

QuantileDiscretizer可以使用 private void bbUpdate_ItemClick(object sender, DevExpress.XtraBars.ItemClickEventArgs e) { var saleInvoiceInDb= _dbContext.SaleInvoices.Find(_id); if (saleInvoiceInDb == null) return; var saleInvoiceDetails = ((IEnumerable) gvSaleInvoiceDetails.DataSource).Cast<SaleInvoiceDetail>().ToList(); using (var context = new ErpDbContext()) { foreach (var saleInvoiceDetail in saleInvoiceDetails) { if (saleInvoiceDetail.Id != 0) { var invoiceDetailInDb = context.SaleInvoiceDetails.Find(saleInvoiceDetail.Id); invoiceDetailInDb = saleInvoiceDetail; invoiceDetailInDb.SaleInvoiceId = saleInvoiceInDb.Id; } else { saleInvoiceDetail.SaleInvoiceId = saleInvoiceInDb.Id; context.SaleInvoiceDetails.Add(saleInvoiceDetail); } context.SaveChanges(); } } saleInvoiceInDb.CompleteDateTime = DateTime.Today; saleInvoiceInDb.ContranctCode = ""; saleInvoiceInDb.DateTimeAdded = DateTime.Now; saleInvoiceInDb.DateTimeUpdated = DateTime.Now; saleInvoiceInDb.DeliveryDateTime = txtDeliveryDate.DateTime; saleInvoiceInDb.DueDateTime = txtDueDateTime.DateTime; saleInvoiceInDb.InvoiceDescription = txtDescription.Text; saleInvoiceInDb.InvoiceToAdress = txtinvoiceToAdress.Text; saleInvoiceInDb.Period = new DateTime(txtPeriod.DateTime.Year, txtPeriod.DateTime.Month, 1); saleInvoiceInDb.SaleType = txtSoType.Text; saleInvoiceInDb.Status = (byte)txtStatus.ItemIndex; saleInvoiceInDb.SupplierId = _supplierId; saleInvoiceInDb.SupplierCode = txtSupplierCode.Text; saleInvoiceInDb.SupplierInvoiceReference = ""; saleInvoiceInDb.TransactionDateTime = txtTransactionDate.DateTime; saleInvoiceInDb.OrderDate = txtOrderDate.DateTime; saleInvoiceInDb.OrderPeriod = new DateTime(txtOrderDate.DateTime.Year, txtOrderDate.DateTime.Month, 1); saleInvoiceInDb.UserAdded = clsUserProfile.US_LOGIN; saleInvoiceInDb.UserUpdated = clsUserProfile.US_LOGIN; _dbContext.SaveChanges(); Messages.Information("Invoice updated successfully"); } }

'pyspark.ml.feature'

答案 1 :(得分:-1)

您可以将pyspark.sql.functions中的percent_rank与窗口功能一起使用。例如,对于计算十分之一,您可以执行以下操作:

from pyspark.sql.window import Window
from pyspark.sql.functions import percent_rank

w =  Window.orderBy(data.var1)
data.select('*', ceil(10 * percent_rank().over(w)).alias("decile"))

这样做,您首先要计算percent_rank,然后将其乘以10并获得较高的整数。因此,所有percent_rank在0到0.1之间的值都将加到十分位数1,所有percent_rank在0.1到0.2之间的值都将加到十分位数2,等等。