AvroParquet阅读速度慢吗?

时间:2015-07-17 17:18:15

标签: performance avro parquet

我已经在以不同格式存储的6Go数据上运行了一些性能基准测试:CSV + Gzip,ParquetAvro + Gzip和Parquet + Gzip。我运行了一个简单的Pig脚本,它只加载数据并写入一个空输出。测试在相同条件下在同一台机器上运行。结果如下:

  • CSV + Gzip:23分钟
  • AvroParquet + Gzip:35分钟
  • Parquet + Gzip:11分钟

我看到AvroParquet表现不佳,这是预期的吗?有没有人遇到过同样的问题?

---------------------------------更新------------- --------------------

对于相同的块,InternalParquetRecordReader需要更多时间来读取内存中的记录。但是,它只是工作的一小部分。以下是我在日志中看到的示例:

木地板:

Jul 17, 2015 7:08:49 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 163 ms. row count = 21493464
Jul 17, 2015 7:09:01 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 147 ms. row count = 19451260
Jul 17, 2015 7:09:10 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 97 ms. row count = 13314248
Jul 17, 2015 7:09:15 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 35 ms. row count = 46992661
Jul 17, 2015 7:09:34 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 78 ms. row count = 12035159
Jul 17, 2015 7:09:40 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 28 ms. row count = 17718888
Jul 17, 2015 7:09:49 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 45 ms. row count = 19528965
Jul 17, 2015 7:09:55 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 76 ms. row count = 55428025

VS AvroParquet:

Jul 17, 2015 6:31:26 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 2049 ms. row count = 21493464
Jul 17, 2015 6:31:58 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 1621 ms. row count = 19451260
Jul 17, 2015 6:32:28 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 1467 ms. row count = 13314248
Jul 17, 2015 6:32:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 666 ms. row count = 46992661
Jul 17, 2015 6:33:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 933 ms. row count = 12035159
Jul 17, 2015 6:34:03 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 963 ms. row count = 17718888
Jul 17, 2015 6:34:32 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 2242 ms. row count = 19528965
Jul 17, 2015 6:34:56 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 507 ms. row count = 55428025

0 个答案:

没有答案