Flink CoGroupOperator中的组排序似乎未按预期工作

时间:2019-01-21 11:07:50

标签: java apache-flink

我正在实现flink批处理作业,该作业应该使用 coGroup 转换连接两个数据集。目标是通过userId(1-n关系)按订单加入用户。进入CoGroupFunction之前,应按价格ASC对组中的订单进行排序。我正在尝试使用sortSecondGroup操作来应用排序,但是它没有任何效果,并且数据保持未排序状态。

示例

用户(用户ID,用户名)

{0, john}, {1, jane}, {2, richard}

订单(用户ID,价格,产品名称)

{0, 200, phone}, {0, 5, soap}, {0, 50, book},
{1, 30, shirt},  {1, 15, potato},
{2, 500, laptop},{2, 10, pen}, {2, 300, headphones}

结果(用户名,[产品名称1,产品名称2])-用户产品按价格排序

{john, [soap, book, phone]},
{jane, [potato, shirt]},
{richard, [pen, headphones, laptop]}

职位结构

users
   .coGroup(orders)
   .where(tuple1 -> tuple1.f0)
   .equalTo(tuple2 -> tuple2.f0)
    //sort orders by price
   .sortSecondGroup(1, Order.ASCENDING)
   .with(CoGroupJob::joinUserNameWithProductName);

预期:我期望在sortSecondGroup个操作订单数据集之后,该数据集将按照给定的顺序进行排序并应用于CoGroupJob::joinUserNameWithProductName

实际情况:sortSecondGroup没有任何效果,数据仍然无法排序。

链接版本: 1.6.0

PS:我可以在coGroup之前对整个订单数据集进行排序,但是数据集非常庞大,工作变得非常慢(约10倍)。另一方面,使用Collections.sort()对CoGroupJob::joinUserNameWithProductName中的数据进行排序会导致OutOfMemoryError,这是因为在堆中保存了很多数据。

重现当前问题的Java代码:

import org.apache.flink.api.common.operators.Order;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.util.Collector;

import java.util.ArrayList;
import java.util.List;

public class CoGroupJob {

    public static DataSet<Tuple2<String, List<String>>> processOrders(
            DataSet<Tuple2<Integer, String>> users,
            DataSet<Tuple3<Integer, Integer, String>> orders
    ) {
        return users
                .coGroup(orders)
                .where(tuple1 -> tuple1.f0)
                .equalTo(tuple2 -> tuple2.f0)
                //sort orders by price
                .sortSecondGroup(1, Order.ASCENDING)
                .with(CoGroupJob::joinUserNameWithProductName);
    }

    private static void joinUserNameWithProductName(Iterable<Tuple2<Integer, String>> users, Iterable<Tuple3<Integer, Integer, String>> orders, Collector<Tuple2<String, List<String>>> out) {

        String userName = users.iterator().next().f1;

        List<String> orderNames = new ArrayList<>();
        orders.iterator().forEachRemaining(order -> orderNames.add(order.f2));

        Tuple2<String, List<String>> namesWithOrders = Tuple2.of(userName, orderNames);
        out.collect(namesWithOrders);
    }
}


并测试:


import com.google.common.collect.ImmutableList;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;

import org.junit.jupiter.api.Test;

import java.util.List;
import static org.assertj.core.api.Assertions.assertThat;


class CoGroupJobTest {

    @Test
    void shouldJoinUsersWithOrdersSortedByPrice(ExecutionEnvironment env) throws Exception {

        List<Tuple2<String, List<String>>> userOrders = CoGroupJob.processOrders(
                users(env, ImmutableList.of(
                        Tuple2.of(0, "john"), Tuple2.of(1, "jane"), Tuple2.of(2, "richard")
                )),

                orders(env, ImmutableList.of(
                        Tuple3.of(0, 200, "phone"),  Tuple3.of(0, 5,  "soap"),  Tuple3.of(0, 50,  "book"),
                        Tuple3.of(1, 30,  "shirt"),  Tuple3.of(1, 15, "potato"),
                        Tuple3.of(2, 500, "laptop"), Tuple3.of(2, 10, "pen"),   Tuple3.of(2, 300, "headphones")
                ))
        ).collect();

        assertThat(userOrders)
                .contains(
                        Tuple2.of("john",    ImmutableList.of("soap", "book", "phone")),
                        Tuple2.of("jane",    ImmutableList.of("potato", "shirt")),
                        Tuple2.of("richard", ImmutableList.of("pen", "headphones", "laptop"))
                );
    }

    private static DataSource<Tuple2<Integer, String>> users(ExecutionEnvironment env, List<Tuple2<Integer, String>> users) {
        return env.fromCollection(users, TypeInformation.of(new TypeHint<Tuple2<Integer, String>>(){}));
    }

    private static DataSource<Tuple3<Integer, Integer, String>> orders(ExecutionEnvironment env, List<Tuple3<Integer, Integer, String>> orders) {
        return env.fromCollection(orders, TypeInformation.of(new TypeHint<Tuple3<Integer, Integer, String>>(){}));
    }
}

0 个答案:

没有答案