我已经在以不同格式存储的6Go数据上运行了一些性能基准测试:CSV + Gzip,ParquetAvro + Gzip和Parquet + Gzip。我运行了一个简单的Pig脚本,它只加载数据并写入一个空输出。测试在相同条件下在同一台机器上运行。结果如下:
我看到AvroParquet表现不佳,这是预期的吗?有没有人遇到过同样的问题?
---------------------------------更新------------- --------------------
对于相同的块,InternalParquetRecordReader需要更多时间来读取内存中的记录。但是,它只是工作的一小部分。以下是我在日志中看到的示例:
木地板:
Jul 17, 2015 7:08:49 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 163 ms. row count = 21493464
Jul 17, 2015 7:09:01 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 147 ms. row count = 19451260
Jul 17, 2015 7:09:10 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 97 ms. row count = 13314248
Jul 17, 2015 7:09:15 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 35 ms. row count = 46992661
Jul 17, 2015 7:09:34 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 78 ms. row count = 12035159
Jul 17, 2015 7:09:40 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 28 ms. row count = 17718888
Jul 17, 2015 7:09:49 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 45 ms. row count = 19528965
Jul 17, 2015 7:09:55 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 76 ms. row count = 55428025
VS AvroParquet:
Jul 17, 2015 6:31:26 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 2049 ms. row count = 21493464
Jul 17, 2015 6:31:58 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 1621 ms. row count = 19451260
Jul 17, 2015 6:32:28 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 1467 ms. row count = 13314248
Jul 17, 2015 6:32:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 666 ms. row count = 46992661
Jul 17, 2015 6:33:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 933 ms. row count = 12035159
Jul 17, 2015 6:34:03 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 963 ms. row count = 17718888
Jul 17, 2015 6:34:32 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 2242 ms. row count = 19528965
Jul 17, 2015 6:34:56 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 507 ms. row count = 55428025