SPARK根据一列汇总所有列

时间:2018-04-18 05:43:27

标签: apache-spark dataframe bigdata aggregate

为简单起见,我们假设我有以下daraframe:

col X col Y col Z
A     1     5
A     2     10
A     3     10
B     5     15

我想要Groupby X列并通过取最小值Z来聚合但是我希望Y值是最小值Z的相邻值

df.groupBy("X").agg(min("Z"), take_y_according_to_min_z("Y")

期望的输出:

col X col Y col Z
A     1     5
B     5     15

注意:如果有两个以上的min("Z")值,我不关心我们采用哪一行。

我试图找到一些干净的SPARKy在线的东西。我很清楚如何在MapReduce中做到这一点,但我无法在SPARK上找到方法。

我正在研究SPARK 1.6

2 个答案:

答案 0 :(得分:2)

你可以简单地做

         selectednode = (DefaultMutableTreeNode) TreePro.getLastSelectedPathComponent();          
          DefaultMutableTreeNode node = (DefaultMutableTreeNode) selectednode.getParent();         
          
      
          if(selectednode != null){
            
            
            
              if (selectednode.isLeaf()) {
              
                 Iterator<Map.Entry<Integer, String>> irt = col.entrySet().iterator();
                while(irt.hasNext())
                    {
                     Map.Entry<Integer, String> entry = irt.next();
                         
                     if(selectednode.isLeaf() && entry.getValue().equals(TextField2.getText()))
                        {

                        System.out.println(" Removed. "+entry.getKey());
                        
                        irt.remove();  // Call Iterator's remove method.
                        node.remove(selectednode);
          
                        System.out.println("LinkedHashMap Size :  "+col.size());
                        model.reload(node);
                    
                        }
             
                    }
              }
            
           // The problem begin from here     
              else{
         
                                   
                  int p = JOptionPane.showConfirmDialog(null, "Warning "+selectednode+ " is a Parent node, It will DEL all his child nodes" , "Delete",JOptionPane.YES_NO_OPTION);
                  if(p == 0){
  for (int i = 0; i < selectednode.getChildCount(); i++) {
                TreeNode nodee = selectednode.getChildAt(i);
               
    String batie = nodee.toString();
      System.out.println("batreeee "+batie);
        
               System.out.println("break time "+selectednode.getChildAt(i));
                     Iterator<Map.Entry<Integer, String>> itt = col.entrySet().iterator();
                while(itt.hasNext())
                    {
                     Map.Entry<Integer, String> entryy = itt.next();


                     
                     if( entryy.getValue().equals(TextField2.getText()))
                        {

                        System.out.println(" Removed. "+entryy.getKey());
                        itt.remove();  // Call Iterator's remove method.
                        node.remove(selectednode);
          
                        System.out.println("LinkedHashMap Size :  "+col.size());
                        model.reload(node);
                        }
                       if(  entryy.getValue().equals(batie))
                        {

                        System.out.println(" Removed. "+entryy.getKey());
                        itt.remove();  // Call Iterator's remove method.
                        
          
                        System.out.println("LinkedHashMap Size :  "+col.size());
                        model.reload(node);
                        }      
                    }
                
  }selectednode.removeAllChildren();
     
  }}}}

你会得到你想要的东西

Job1      EXEC +10 03:28 (03:23)  #J18911
Job2      EXEC +10 12:56 (01:55)  #J1766
Job3      EXEC +10 04/05          #J333460
Job4      EXEC +10 02/26 (01:10)  #J3322
Job5      EXEC +10 04:58 (02:23)  #J189115; <04/18
Job6      EXEC +10 16:07 (00:23)  #J189115; &0:05
Job7      EXEC +10 14:00 (01:02)  #J260721; <04/18

答案 1 :(得分:1)

您可以使用struct作为列YZ作为

df.groupBy("X").agg(min(struct("Z", "Y")).as("min"))
    .select("X", "min.*")

输出:

+---+---+---+
|X  |Z  |Y  |
+---+---+---+
|B  |15 |5  |
|A  |5  |1  |
+---+---+---+

希望这有帮助1