根据值将数据集分组到不同的子数据集中

时间:2017-03-21 15:34:38

标签: java apache-spark dataset grouping

我想在数据集上实现一个程序,该数据集由以下几列组成:

+-----------+---------------+-------------------+-----------------------+
|Item_ID    |Product_Name   |Manufacturer_Name  |Product_Description    |
+-----------+---------------+-------------------+-----------------------+
|12345      |Pen            |Cello              |Ball Pen Soft Nib...   |
|12346      |Pencil         |Nataraja           |Pencil HB Extra D...   |
|42345      |Ruler          |Nataraja           |Scale No.1103 15c...   |
|12677      |Sharpener      |Nataraja           |Pencil Shraperner...   |
|12987      |Pen            |Reynolds           |Dot Pen Extra Gr...    |
|44326      |Pen            |Reynolds           |Gel Pen German T...    |
|13456      |Pen            |Cello              |Dot Pen 0.5mm Nib...   |
|19876      |Eraser         |Cello              |Dust free Eraser ...   |
|43246      |Ink Pen        |Hero               |Ink Pen Smooth Ha...   |
+-----------+---------------+-------------------+-----------------------+

我希望根据Manufacturer_Name对数据集进行分组,如下所示

Manufacturer = Cello
+-----------+---------------+-------------------+-----------------------+
|Item_ID    |Product_Name   |Manufacturer_Name  |Product_Description    |
+-----------+---------------+-------------------+-----------------------+
|12345      |Pen            |Cello              |Ball Pen Soft Nib...   |
|13456      |Pen            |Cello              |Dot Pen 0.5mm Nib...   |
|19876      |Eraser         |Cello              |Dust free Eraser ...   |
+-----------+---------------+-------------------+-----------------------+

Manufacturer = Nataraja
+-----------+---------------+-------------------+-----------------------+
|Item_ID    |Product_Name   |Manufacturer_Name  |Product_Description    |
+-----------+---------------+-------------------+-----------------------+
|12346      |Pencil         |Nataraja           |Pencil HB Extra D...   |
|42345      |Ruler          |Nataraja           |Scale No.1103 15c...   |
|12677      |Sharpener      |Nataraja           |Pencil Shraperner...   |
+-----------+---------------+-------------------+-----------------------+

Manufacturer = Reynolds
+-----------+---------------+-------------------+-----------------------+
|Item_ID    |Product_Name   |Manufacturer_Name  |Product_Description    |
+-----------+---------------+-------------------+-----------------------+
|12987      |Pen            |Reynolds           |Dot Pen Extra Gr...    |
|44326      |Pen            |Reynolds           |Gel Pen German T...    |
+-----------+---------------+-------------------+-----------------------+

Manufacturer = Hero
+-----------+---------------+-------------------+-----------------------+
|Item_ID    |Product_Name   |Manufacturer_Name  |Product_Description    |
+-----------+---------------+-------------------+-----------------------+
|43246      |Ink Pen        |Hero               |Ink Pen Smooth Ha...   |
+-----------+---------------+-------------------+-----------------------+

我尝试使用以下代码,但效果不佳。帮我改进这个程序。这是我使用的代码:

Dataset<Row> countsBy = src.select("Manufacturer_Name").distinct();
List<Row> lsts = countsBy.collectAsList();
for (Row lst : lsts) {
    String man = lst.toString();
    System.out.println("Records of " + man + " only");
    Dataset<Row> mandataset = src.filter("Manufacturer_Name='" + man + "'");
    mandataset.show();
}

1 个答案:

答案 0 :(得分:0)

也许你可以尝试制作数据集的地图,键是一个字符串(Manufacturer_Name),每次迭代,你检查Manufacturer_Name,然后检查它是否已经在地图中(你创建它)如果需要的话)最后,你在好的数据集中添加你的行。

你会有类似的东西:

MPNowPlayingInfoCenter *center = [MPNowPlayingInfoCenter defaultCenter];
            NSDictionary *songInfo = [NSDictionary dictionaryWithObjectsAndKeys:
                                      [NSNumber numberWithDouble:songDuration],MPMediaItemPropertyPlaybackDuration,
                                      nil];


            [center setNowPlayingInfo:songInfo];

然后您需要第二个循环,但仅用于打印数据。

我希望它能解决你的问题!

编辑:通过地图重新提起Dictionnary(抱歉)并提供链接

How do you create a dictionary in Java?

编辑:更改了代码以匹配新想法