Question

经过一番搜索，我未能彻底比较fastparquet和pyarrow。

我找到了这个博客post（速度的基本比较）。

和一个github discussion声明使用fastparquet创建的文件不支持AWS-athena（顺便说一句？）

何时/为什么要在另一个上使用一个？主要优点和缺点是什么？

我的特定用例是使用dask处理数据并将其写入s3，然后使用AWS-athena进行读取/分析。

Answer 1

然而，由于这个问题缺乏具体的标准，我来到这里是为了一个好的“默认选择”，我想声明 DataFrame 对象的 pandas 默认引擎是 pyarrow（见pandas docs）。

Answer 2

我要指出的是，速度比较的作者也是pyarrow的作者：）我可以说说fastparquet案。

从您的角度来看，最重要的要知道的是兼容性。雅典娜不是快速镶木地板（或pyarrow）的测试目标之一，因此您应该在进行选择之前进行彻底的测试。对于日期时间表示，空值和类型，您可能要调用许多选项（docs），这些选项对您可能很重要。

使用dask写入s3无疑是fastparquet的测试用例，我相信pyarrow也应该没有问题。

Answer 3

我同时使用fastparquet和pyarrow来将protobuf数据转换为镶木地板，并在S3中使用Athena进行查询。但是，在我的用例（即lambda函数）中，这两种方法都有效，打包zip文件必须轻巧，因此请使用fastparquet。（fastparquet库仅约1.1mb，pyarrow库为176mb，Lambda软件包限制为250mb。）

我使用以下内容将数据帧存储为实木复合地板文件：

    resizeWrapper() {
        if ($promoBar.length && Mediaqueries.isMinSmall()) {
            if (Mediaqueries.isMinLarge()) {
                var height = $(window).height() - $('.header').height() - 60;
                $('.cel-Product-infos, .js-cel-Product-Gallery').css('height', height);
            }
        } else {
            $('.cel-Product-infos, .js-cel-Product-Gallery').attr('style', '');
        }
    }

Answer 4

我只是使用fastparquet作为一个案例，以从Elasticsearch中获取数据并将其存储在S3中并向Athena进行查询，完全没有问题。

我使用以下内容将数据帧作为木地板文件存储在S3中：

import s3fs
import fastparquet as fp
import pandas as pd
import numpy as np

s3 = s3fs.S3FileSystem()
myopen = s3.open
s3bucket = 'mydata-aws-bucket/'

# random dataframe for demo
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

parqKey = s3bucket + "datafile"  + ".parq.snappy"
fp.write(parqKey, df ,compression='SNAPPY', open_with=myopen)

我的桌子在雅典娜看起来像这样：

CREATE EXTERNAL TABLE IF NOT EXISTS myanalytics_parquet (
  `column1` string,
  `column2` int,
  `column3` DOUBLE,
  `column4` int,
  `column5` string
 )
STORED AS PARQUET
LOCATION 's3://mydata-aws-bucket/'
tblproperties ("parquet.compress"="SNAPPY")

Answer 5

这个问题可能有点老了，但是我碰巧正在研究同一问题，所以我找到了基准https://wesmckinney.com/blog/python-parquet-update/。根据它的说法，pyarrow比fastparquet快，难怪它是dask中使用的默认引擎。

fastparquet和pyarrow之间的比较？

5 个答案: