Question

情景：

我已将数据从SQl Server导入HDFS。数据存储在HDFS目录中的多个文件中：

部分-M-00000
  部分-M-00001
  部分-M-00002
  part-m-00003

问题：

我的问题是，当从HDFS目录中读取这些存储的数据时，我们必须读取所有文件（part-m-00000,01,02,03）或仅part-m-00000。因为当我读取数据时，我发现HDFS中的数据有点缺失。那么，它是否会发生或我错过了什么？

Answer 1

您需要阅读所有文件，而不仅仅是00000.有多个文件的原因是sqoop以map-reduce方式工作，将“导入”工作分成多个部分。每个部分的输出都放在一个单独的文件中。

RL

Answer 2

Sqoop is running the import with no reducers.As a result,there is no consolidation for the part files which were processed by the mappers.Hence you will see part files depending upon the number of mappers you have set in the sqoop command as --m4 or --num-4.So if you provide sqoop import --connect jdbc:mysql://localhost/db --username <>--table <>--m1 then it will create only one part file.

Answer 3

如果您的结果大小很大，那么Hive会将结果存储在块中。如果要使用CLI读取所有文件，请执行以下命令。

$ sudo cat part-m-*

它将为您提供最终结果而不会遗漏任何部分。

与sqoop-import相关的查询？

情景：

问题：

3 个答案: