SparkSQL has a an excellent trick: it will read your parquet data, correctly reading the schema from parquet's metadata. What's more: if you have data partitioned using a key=value
schema, SparkSQL will automatically recurse through a directory structure, reading those value
s in as a column called key
. Documentation on this -- along with a pretty clear example -- here.
Unfortunately, my data is partitioned in a way that works nicely for Cascading, doesn't seem to jive with SparkSQL's expectations:
2015/
└── 04
├── 29
│ └── part-00000-00002-r-00000.gz.parquet
└── 30
└── part-00000-00001-r-00000.gz.parquet
In Cascading, I can specify a PartitionTap
, tell it the first three items will be year
, month
, and day
, and I'm off to the races. But I cannot figure out how to achieve a similar effect in SparkSQL. Is it possible to do any of:
FileSystem
API, but I'd really rather not.)(My parquet data contains nested structures, which SparkSQL can read and interact with beautifully as long as I let it do so automagically. If I try to manually specify the Schema, it cannot seem to handle nested structures.)