Question

我正在使用分区的配置单元orc表，我试图使用此查询在spark中读取

spark.sql("select count(*) from test.puid_tuid where date = '20170316'").show

但是收到此错误

引起：java.io.FileNotFoundException：File HDFS：//本地主机：8020 /蜂房/仓库/ test.db的/ puid_tuid /日期= 20170316 不存在。在 org.apache.hadoop.hdfs.DistributedFileSystem $ DirListingIterator。（DistributedFileSystem.java:948）在 org.apache.hadoop.hdfs.DistributedFileSystem $ DirListingIterator。（DistributedFileSystem.java:927）在 org.apache.hadoop.hdfs.DistributedFileSystem $ 19.doCall（DistributedFileSystem.java:872）在 org.apache.hadoop.hdfs.DistributedFileSystem $ 19.doCall（DistributedFileSystem.java:868）在 org.apache.hadoop.fs.FileSystemLinkResolver.resolve（FileSystemLinkResolver.java:81）在 org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus（DistributedFileSystem.java:886）在 org.apache.hadoop.fs.FileSystem.listLocatedStatus（FileSystem.java:1696）在 org.apache.hadoop.hive.shims.Hadoop23Shims.listLocatedStatus（Hadoop23Shims.java:667）在 org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState（AcidUtils.java:361）在 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat $ FileGenerator.call（OrcInputFormat.java:634）在 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat $ FileGenerator.call（OrcInputFormat.java:620）在java.util.concurrent.FutureTask.run（FutureTask.java:266）at java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor.java:1142）在 java.util.concurrent.ThreadPoolExecutor中的$ Worker.run（ThreadPoolExecutor.java:617）在java.lang.Thread.run（Thread.java:745）

所以我在我的hdfs中检查了这条路径，而不是那里。

然后我在hive中执行相同的查询，结果是零记录。

我还列出了所有配置单元分区，它包含相同的分区。

在spark中是否有任何方法可以忽略hdfs中没有文件的所有分区？

无法使用spark读取hive分区

0 个答案: