Question

我只是想知道人们在从Hive读取和从.csv文件或.txt文件或.ORC文件或.parquet文件中读取时的想法。假设底层的Hive表是一个具有相同文件格式的外部表，您是希望从Hive表中读取还是从底层文件本身读取，为什么？

麦克

Answer 1

tl;博士：我会直接从镶木地板文件中读到它

我正在使用Spark 1.5.2和Hive 1.2.1 对于5千万行X 100列表，我记录的一些时间是

val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")

稀疏计数 - ＆gt; 0.38s; dfhive count - ＆gt; 8.99s

流动总和（col） - ＆gt; 0.98s; dfhive sum（col） - ＆gt; 8.10s

dffile substring（col） - ＆gt; 2.63s; dfhive substring（col） - ＆gt; 7.77s

dffile where（col = value） - ＆gt; 82.59s; dfhive where（col = value） - ＆gt; 157.64s

请注意，这些是使用较旧版本的Hive和较旧版本的Spark完成的，因此我无法评论两种读取机制之间如何提高速度

Answer 2

据我了解，即使一般.ORC更适合平面结构，parquet适用于嵌套结构，spark也会针对parquet进行优化。因此，建议将该格式与spark一起使用。

此外，来自Metadata的所有阅读表的parquet无论如何都会存储在hive中。这是火花文档：Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

我倾向于将数据asat转换为parquet格式，并将其存储为alluxio，由hdfs支持。这使我可以为read/write操作获得更好的性能，并限制使用cache。

我希望它有所帮助。

Spark从蜂巢中选择还是从文件中选择是否更好？

2 个答案: