基于Spark中的现有Dataframe创建新的Dataframe列

时间:2018-05-30 09:26:25

标签: scala apache-spark

有两个DF,我需要在以下条件下填充DF1Flag的新列。

DF1
    +------+-------------------+
    ||AMOUNT|Brand             | 
    +------+-------------------+
    | 47.88|          Parle    |
    | 40.92|          Parle    |
    | 83.82|          Parle    |
    |106.58|          Parle    |
    | 90.51|          Flipkart |
    | 11.48|          Flipkart |
    | 18.47|          Flipkart |
    | 40.92|          Flipkart |
    |  30.0|          Flipkart |
    +------+-------------------+


DF2

+--------------------+-------+----------+
|       Brand        |   P1  |   P2     |
+--------------------+-------+----------+
|               Parle| 37.00 |  100.15  |
|            Flipkart|  10.0 |  30.0    |
+--------------------+-------+----------+

如果Parle中的品牌DF1的金额低于(Amount < P1)品牌的DF2中的P1值Parle&#34; low&# 34;,然后标志为P1 >= amount <= P2,如果mid比标志为Amount > P2,如果high则为expected output +------+-------------------+----------------+ ||AMOUNT|Brand | Flag | +------+-------------------+----------------+ | 47.88| Parle | mid | | 40.92| Parle | mid | | 83.82| Parle | mid | |106.58| Parle | high | | 90.51| Flipkart | high | | 11.48| Flipkart | mid | | 18.47| Flipkart | mid | | 40.92| Flipkart | high | | 30.0| Flipkart | mid | +------+-------------------+---------------- , 同样对其他商家而言。

DF1的数据非常庞大,DF2非常小。

@Override
protected void onActivityResult(int requestCode, int resultCode, Intent data) {
    super.onActivityResult(requestCode, resultCode, data);

    final HashMap <Object,Object> datamap = new HashMap<Object, Object>();

    if(requestCode == GALLERY_INTENT && resultCode == RESULT_OK){

        if(data.getClipData() != null){

            totalItelmsSelected = data.getClipData().getItemCount();

            for(i=0; i < totalItelmsSelected; i++){

                image = data.getClipData().getItemAt(i).getUri();
                final StorageReference filepath = mStorage.child("photos").child(image.getLastPathSegment()+".jpg");
                filepath.putFile(image).addOnSuccessListener(new OnSuccessListener<UploadTask.TaskSnapshot>() {
                    @Override
                    public void onSuccess(UploadTask.TaskSnapshot taskSnapshot) {

                        photoPath = filepath.getPath();
                        datamap.put("image "+i , photoPath);
                        datamap.put("Name", "Kamil");
                        datamap.put("Email", "Kamil@gmail.com");

                      mDatabase.child("users").push().setValue(datamap).addOnSuccessListener(new OnSuccessListener<Void>() {
                           @Override
                           public void onSuccess(Void aVoid) {
                               Toast.makeText(MainActivity.this, "Uploading to the database is done", Toast.LENGTH_SHORT).show();
                           }
                       }).addOnFailureListener(new OnFailureListener() {
                           @Override
                           public void onFailure(@NonNull Exception e) {
                               Toast.makeText(MainActivity.this, "Problem in registuring the information", Toast.LENGTH_SHORT).show();
                           }
                       });
                        Toast.makeText(MainActivity.this, i+" Photos has been uploaded.", Toast.LENGTH_SHORT).show();

                    }
                }).addOnFailureListener(new OnFailureListener() {
                    @Override
                    public void onFailure(@NonNull Exception e) {
                        Toast.makeText(MainActivity.this, "Problem in Uploading "+i+" Photos.", Toast.LENGTH_SHORT).show();
                    }
                });
            }
        }
    }
}

我知道我可以进行连接并得到结果,但我应该如何构建到spark中的逻辑。

2 个答案:

答案 0 :(得分:1)

简单的left加入和嵌套的when内置函数应该可以得到您想要的结果

import org.apache.spark.sql.functions._
df1.join(df2, Seq("Brand"), "left")
  .withColumn("Flag", when(col("AMOUNT") < col("P1"), "low").otherwise(
    when(col("AMOUNT") >= col("P1") && col("AMOUNT") <= col("P2"), "mid").otherwise(
      when(col("AMOUNT") > col("P2"), "high").otherwise("unknown"))))
    .select("AMOUNT", "Brand", "Flag")
  .show(false)

应该给你

+------+--------+----+
|AMOUNT|Brand   |Flag|
+------+--------+----+
|47.88 |Parle   |mid |
|40.92 |Parle   |mid |
|83.82 |Parle   |mid |
|106.58|Parle   |high|
|90.51 |Flipkart|high|
|11.48 |Flipkart|mid |
|18.47 |Flipkart|mid |
|40.92 |Flipkart|high|
|30.0  |Flipkart|mid |
+------+--------+----+

我希望答案很有帮助

答案 1 :(得分:0)

我认为用udf也是可行的。

func collectionView(_ collectionView: UICollectionView, numberOfItemsInSection section: Int) -> Int {
    switch collectionView {
    case firstCollectionView:
        return 1
    case secondCollectionView:
        return 2
    case thirdCollectionView:
        return 3
    default:
        return 0
    }
}