我有一个带有多个数字列的pyspark DF,我希望每列根据每个变量计算该行的十分位数或其他分位数等级。
这对于pandas来说很简单,因为我们可以使用qcut函数为每个变量创建一个新列,为'q'分配值0到n-1,如pd.qcut(x,q = n)。
如何在pyspark中完成?我已经尝试了以下但很明显这些三分之一的断点并不是唯一的。我想得到分配1的下1/3,下一个1/3分配2和前1/3分配3.我希望能够改变这个并且可能使用1 / 10,1 / 32等
w = Window.partitionBy(data.var1).orderBy(data.var1)
d2=df.select(
"var1",
ntile(3).over(w).alias("ntile3")
)
agged=d2.groupby('ntile3').agg(F.min("var1").alias("min_var1"),F.max("var1").alias("max_var1"),F.count('*'))
agged.show()
+------+--------+--------+--------+
|ntile3|min_var1|max_var1|count(1)|
+------+--------+--------+--------+
| 1| 0.0| 210.0| 517037|
| 3| 0.0| 206.0| 516917|
| 2| 0.0| 210.0| 516962|
+------+--------+--------+--------+
答案 0 :(得分:6)
QuantileDiscretizer可以使用 private void bbUpdate_ItemClick(object sender, DevExpress.XtraBars.ItemClickEventArgs e)
{
var saleInvoiceInDb= _dbContext.SaleInvoices.Find(_id);
if (saleInvoiceInDb == null) return;
var saleInvoiceDetails = ((IEnumerable) gvSaleInvoiceDetails.DataSource).Cast<SaleInvoiceDetail>().ToList();
using (var context = new ErpDbContext())
{
foreach (var saleInvoiceDetail in saleInvoiceDetails)
{
if (saleInvoiceDetail.Id != 0)
{
var invoiceDetailInDb = context.SaleInvoiceDetails.Find(saleInvoiceDetail.Id);
invoiceDetailInDb = saleInvoiceDetail;
invoiceDetailInDb.SaleInvoiceId = saleInvoiceInDb.Id;
}
else
{
saleInvoiceDetail.SaleInvoiceId = saleInvoiceInDb.Id;
context.SaleInvoiceDetails.Add(saleInvoiceDetail);
}
context.SaveChanges();
}
}
saleInvoiceInDb.CompleteDateTime = DateTime.Today;
saleInvoiceInDb.ContranctCode = "";
saleInvoiceInDb.DateTimeAdded = DateTime.Now;
saleInvoiceInDb.DateTimeUpdated = DateTime.Now;
saleInvoiceInDb.DeliveryDateTime = txtDeliveryDate.DateTime;
saleInvoiceInDb.DueDateTime = txtDueDateTime.DateTime;
saleInvoiceInDb.InvoiceDescription = txtDescription.Text;
saleInvoiceInDb.InvoiceToAdress = txtinvoiceToAdress.Text;
saleInvoiceInDb.Period = new DateTime(txtPeriod.DateTime.Year, txtPeriod.DateTime.Month, 1);
saleInvoiceInDb.SaleType = txtSoType.Text;
saleInvoiceInDb.Status = (byte)txtStatus.ItemIndex;
saleInvoiceInDb.SupplierId = _supplierId;
saleInvoiceInDb.SupplierCode = txtSupplierCode.Text;
saleInvoiceInDb.SupplierInvoiceReference = "";
saleInvoiceInDb.TransactionDateTime = txtTransactionDate.DateTime;
saleInvoiceInDb.OrderDate = txtOrderDate.DateTime;
saleInvoiceInDb.OrderPeriod = new DateTime(txtOrderDate.DateTime.Year, txtOrderDate.DateTime.Month, 1);
saleInvoiceInDb.UserAdded = clsUserProfile.US_LOGIN;
saleInvoiceInDb.UserUpdated = clsUserProfile.US_LOGIN;
_dbContext.SaveChanges();
Messages.Information("Invoice updated successfully");
}
}
。
'pyspark.ml.feature'
答案 1 :(得分:-1)
您可以将pyspark.sql.functions中的percent_rank
与窗口功能一起使用。例如,对于计算十分之一,您可以执行以下操作:
from pyspark.sql.window import Window
from pyspark.sql.functions import percent_rank
w = Window.orderBy(data.var1)
data.select('*', ceil(10 * percent_rank().over(w)).alias("decile"))
这样做,您首先要计算percent_rank,然后将其乘以10并获得较高的整数。因此,所有percent_rank在0到0.1之间的值都将加到十分位数1,所有percent_rank在0.1到0.2之间的值都将加到十分位数2,等等。