为什么？

对于分裂数据，Sqoop会点火

SELECT MIN(col1), MAX(col2) FROM TABLE

然后按照映射器的数量划分它。

现在以整数为例--split-by列

表有一些id列，其值为1到100，并且您在sqoop命令中使用了4个映射器（-m 4）

Sqoop使用：

获取MIN和MAX值

SELECT MIN(id), MAX(id) FROM TABLE

输出：

1100

分割整数很容易。你将分为4部分：

1-25
25-50
51-75
76-100

现在字符串为--split-by列

表有一些name列，其值为“dev”到“sam”，并且你在sqoop命令中使用了4个映射器（-m 4）

Sqoop使用：

获取MIN和MAX值

SELECT MIN(id), MAX(id) FROM TABLE

输出：

dev的，SAM

现在将如何分为4个部分。根据sqoop docs，

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however, we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */

您将在代码中看到警告：

LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
    + "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");

如果是Integer示例，所有映射器将获得平衡负载（所有将从RDBMS获取25条记录）。

对于字符串，数据排序的可能性较小。因此，很难给所有映射器提供类似的负载。

简而言之，将整数列转到--split-by列。

Answer 2

是的，我们可以这样做，但由于性能问题，不推荐。因为我们知道 SQOOP 运行边界查询“select min(pk/split-by column), max(pk/split-by column) from table where condition”来计算映射器的拆分大小。 split-size = (max - min)/映射器数量

假设有一张名为employee 的表。

id      name  age
  1       baba  20
  2       kishor 30
  3       jay    40
  ..........
  10001   pk    60

场景 1：

对 id 列执行拆分

在这种情况下，SQOOP 将触发边界查询 select min(id),max(id) from employee 以计算拆分大小。

min = 1
max = 100001

default no of mapper = 4

split-size = (10001-1)/4 = 25000

so each mapper will process 25000 lines of record.
mapper 1:  1 - 25000
mapper 2:  25001-50000
mapper 3:  50001-75000
mapper 4:  75001-100000

所以如果我们有整数列，SQOOP 很容易拆分记录。

场景 2：

对名称列执行拆分

在这种情况下，SQOOP 将触发“select min(name),max(name) from employee”来计算拆分大小。

min = baba, max= pk

SQOOP 无法轻松计算拆分大小，因为 min 和 max 具有文本值（（最小值-最大值）/没有映射器），因此它将运行 TextSplitter 类来执行拆分，这将产生额外的开销并可能影响性能。

注意：我们需要传递额外的参数 -D org.apache.sqoop.splitter.allow_text_splitter= true 来使用 TextSplitter 类。

Answer 3

不，它必须是数字，因为根据规范：“默认情况下，sqoop将使用查询选择min（），max（）来查找创建拆分的边界。”另一种方法是使用--boundary-query，它也需要数字列。否则，Sqoop作业将失败。如果表中没有这样的列，唯一的解决方法是仅使用1个映射器：“ - m 1”。

Sqoop导入按列数据类型拆分

3 个答案:

为什么？