Spark and Parquet: Reading Partitioned Data

时间:2016-07-11 23:22:19

标签: apache-spark-sql

SparkSQL has a an excellent trick: it will read your parquet data, correctly reading the schema from parquet's metadata. What's more: if you have data partitioned using a key=value schema, SparkSQL will automatically recurse through a directory structure, reading those values in as a column called key. Documentation on this -- along with a pretty clear example -- here.

Unfortunately, my data is partitioned in a way that works nicely for Cascading, doesn't seem to jive with SparkSQL's expectations:

2015/
└── 04
    ├── 29
    │   └── part-00000-00002-r-00000.gz.parquet
    └── 30
        └── part-00000-00001-r-00000.gz.parquet

In Cascading, I can specify a PartitionTap, tell it the first three items will be year, month, and day, and I'm off to the races. But I cannot figure out how to achieve a similar effect in SparkSQL. Is it possible to do any of:

  1. Just ignore the partitioning; recurse down to parquet data and read everything found. (I am aware that I could roll my own code to this effect using Hadoop's FileSystem API, but I'd really rather not.)
  2. Specify a partial schema -- e.g. "columns are year (int), month (int), day (int), plus infer the rest from parquet"
  3. Specify the whole schema?

(My parquet data contains nested structures, which SparkSQL can read and interact with beautifully as long as I let it do so automagically. If I try to manually specify the Schema, it cannot seem to handle nested structures.)

0 个答案:

没有答案