Python:使用petastorm读取存储在s3上的Parquet文件会生成连接警告

时间:2019-05-14 17:14:54

标签: python tensorflow urllib3 petastorm

我有一个Tensorflow模型,我想用存储在s3上的实木复合地板文件来填充。我正在使用petastorm从s3查询这些文件,由于petastorm.tf_utils.make_petastorm_dataset,查询结果被存储为Tensorflow数据集。

这是我使用的代码(主要受此线程Tensorflow Dataset API: input pipeline with parquet files的启发):

import s3fs
from pyarrow.filesystem import S3FSWrapper
from petastorm.reader import Reader
from petastorm.tf_utils import make_petastorm_dataset

dataset_url = "analytics.xxx.xxx" #s3 bucket name

fs = s3fs.S3FileSystem()
wrapped_fs = S3FSWrapper(fs)

with Reader(pyarrow_filesystem=wrapped_fs, dataset_path=dataset_url) as reader:
    dataset = make_petastorm_dataset(reader)

这工作得很好,除了它会生成20多行连接警告:

W0514 18:56:42.779965 140231344908032 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.782773 140231311337216 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.854569 140232468973312 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.868761 140231328122624 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.885518 140230816429824 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
...

根据该线程urllib3 connectionpool - Connection pool is full, discarding connection,它当然与urllib3有关,但是我找不到解决这些警告的方法。

有人遇到过这个问题吗?

1 个答案:

答案 0 :(得分:0)

在Github上获得答案:https://github.com/uber/petastorm/issues/376。 将连接池设置用于boto3,并增加max_pool_connections

fs = s3fs.S3FileSystem(config_kwargs = {'max_pool_connections': 50})