Question

我正在通过hadoop，sqoop，猪，水槽等获得'实践经验'......

在我的本地mysql架构中，我有一个名为Employee的表，结构如下：

`emp_id` int(11) NOT NULL AUTO_INCREMENT
`first_name` varchar(30) NOT NULL
`last_name` varchar(30) NOT NULL
`create_date` datetime NOT NULL

员工表有四行。

我运行了以下sqoop命令：

sqoop --options-file import.txt \
--query "select 1 as emp_id, 'Barry' as first_name, 'Williams' as last_name, '2016-04-20 15:41:00' as create_date from test.Employee where \$CONDITIONS" \
--target-dir /user/<username>/Employee  \
--split-by emp_id \
-m 1

在sqoop命令select ...中只有一行数据。因此，只应插入一行。

sqoop命令的结果：

当我运行以下命令时：

hdfs dfs -cat /user/<username>/Employee/part-m-00000

我明白了：

1,Barry,Williams,2016-04-20 15:41:00
1,Barry,Williams,2016-04-20 15:41:00
1,Barry,Williams,2016-04-20 15:41:00
1,Barry,Williams,2016-04-20 15:41:00

问题：

1) Why were four rows inserted instead of one?
2) Is it because there were four rows in the table when the `sqoop` command ran? 
3) Is this a bug?

提前致谢。

Answer 1

不，这不是错误。你是以错误的方式查询您需要将LIMIT添加到您的SQL查询中。更新的查询将如下所示：

sqoop --options-file import.txt \
--query "select 1 as emp_id, 'Barry' as first_name, 'Williams' as last_name, '2016-04-20 15:41:00' as create_date from test.Employee  LIMIT 1 where \$CONDITIONS" \
--target-dir /user/<username>/Employee  \
--split-by emp_id \
-m 1

Answer 2

我不确定这是否是一个错误，但这很有趣，我从未试图以这种方式执行sqoop命令。

--split-by sqoop使用指定的列（主键）来拆分工作单元。

-m 1迫使sqoop只使用1个映射器。

您有一个自由格式查询导入，基于查询sqoop应该只创建1行。我的假设是你有--split-by和{传递给独家新闻的-m 1个选项/参数;也许--split-by优先于-m。通常sqoop在没有指定-m的情况下仅使用4个映射器执行作业，我猜每个映射器在sql语句中创建了1行硬编码字段。

尝试不带--split-by参数的sqoop命令。

Answer 3

我不知道你为什么得到4条记录。我只在我的系统中获得1条记录。请在WHERE $ CONDITIONS之后的select ...查询结尾处添加限制1并查看。希望这可能有用

Answer 4

Sqoop工作正常。尝试对数据库运行此查询，您将看到输出将等于该表中的行数。

为什么“Sqoop import --query ...”在只插入一行时插入多行？

4 个答案: