Question

您好我需要一个lambda函数来读取和写入镶木地板文件并将它们保存到S3。我尝试使用我需要使用pyarrow的库创建一个部署包，但是我收到了cffi库的初始化错误：

SELECT b.uid, b.acnumber,SUM(a.amount_charged) 
FROM bookings b, accounts a 
WHERE b.acnumber = a.acnumber 
AND b.status = 'checkedin' 
AND b.Indate > '2017-07-01'
GROUP BY b.uid,b.acnumber;

我甚至可以使用AWS Lambda制作镶木地板文件吗？有没有人有类似的问题？

我想做这样的事情：

module initialization error: [Errno 2] No such file or directory: '/var/task/__pycache__/_cffi__x762f05ffx6bf5342b.c'

或者通过其他一些方法，只需要能够读取和编写用snappy压缩的镶木地板文件。

Answer 1

我认为这是在部署到lambda的软件包中缺少snappy共享对象文件的问题。

https://github.com/andrix/python-snappy/issues/52#issuecomment-342364113

尝试使用Lambda函数（从没有写入权限的目录调用）中的snappy进行编码时遇到了同样的错误，包括我的zipfile中的libsnappy.so.1解析了它。

Answer 2

要包含Snappy压缩/解压缩所需的依赖项，请参阅Paul Zielinski的答案。

关于编写（和阅读）S3本身，您还需要使用s3fs（并将其打包在zip中），将以下内容添加到您的代码中：

import s3fs
s3 = s3fs.S3FileSystem()

with s3.open('s3://your-bucket/path/to/test.parquet', 'wb') as f:
    pq.write_table(table, f)

with s3.open('s3://your-bucket/path/to/test.parquet', 'rb') as f:
    table = pq.read_table(f)

关于您对table.to_pandas()的使用情况的说明：我不相信这种方法可以在桌面上使用，所以如果你不指定它（df = table.to_pandas()），那么它就可以了。没用了。

最后，您还可以使用以下内容直接从S3读取完整（分区）数据集：

dataset = pq.ParquetDataset(
    'your-bucket/path/to/your/dataset',
    filesystem=s3)
table = dataset.read()

path/to/your/dataset是包含数据集的目录的路径。

感谢Wes McKinney和DrChrisLevy(Github)提供ARROW-1213中提供的最后一个解决方案！

使用AWS Lambda读/写镶木地板文件？

2 个答案: