与Mac OS上的scikit-learn的ibm_boto3兼容性问题

时间:2018-05-04 20:08:52

标签: python scikit-learn object-storage ibm-cloud-storage

我有一个使用scikit-learn的Python 3.6应用程序,部署到IBM Cloud(Cloud Foundry)。它工作正常。我的本地开发环境是Mac OS High Sierra。

最近,我在应用程序中添加了IBM Cloud Object Storage功能(ibm_boto3)。 COS功能本身工作正常。我可以使用^BAZ\s*=\s*([^\s\/]*).*/\1 ^ Match beginning of line BAZ Match BAZ \s*=\s* Match equal sign surrounded by zero or more spaces ([^\s\/]*) Capture in Group 1 any character that is not space or slash .* Match the rest of the text /\1 Replace matched text with text in Group 1 库轻松上传,下载,列出和删除对象。

奇怪的是,使用ibm_boto3的应用程序部分现在冻结了。

如果我注释掉ibm_boto3 scikit-learn语句(以及相应的代码),那么import代码就可以正常工作。

更令人困惑的是,这个问题只发生在运行OS X的本地开发机器上。当应用程序部署到IBM Cloud时,它运行正常 - scikit-learnscikit-learn都可以正常工作侧的。

此时我们唯一的假设是ibm_boto3库在某种程度上表现了ibm_boto3中的一个已知问题(参见this - 当{{{3}}时,K-means算法的并行版本被破坏了{1}}在OS X上使用Accelerator)。 请注意,我们只有在向项目添加scikit-learn时才会遇到此问题。

但是,我们需要能够在部署到IBM Cloud之前在localhost上进行测试。 Mac OS上的numpyibm_boto3之间是否存在已知的兼容性问题?

关于我们如何在开发机器上避免这种情况的任何建议?

干杯。

1 个答案:

答案 0 :(得分:1)

到目前为止,还没有任何已知的兼容性问题。 :)

在某些时候,OSX附带的vanilla SSL库存在一些问题,但是如果您能够读取和写入不是问题的数据。

您使用的是HMAC credentials吗?如果是这样,我很好奇如果你使用原始boto3库而不是IBM fork,行为是否会继续。

以下是一个简单示例,说明如何将pandas与原始boto3一起使用:

import boto3  # package used to connect to IBM COS using the S3 API
import io  # python package used to stream data
import pandas as pd  # lightweight data analysis package

access_key = '<access key>'
secret_key = '<secret key>'
pub_endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'
pvt_endpoint = 'https://s3-api.us-geo.objectstorage.service.networklayer.com'
bucket = 'demo'  # the bucket holding the objects being worked on.
object_key = 'demo-data'  # the name of the data object being analyzed.
result_key = 'demo-data-results'  # the name of the output data object.


# First, we need to open a session and create a client that can connect to IBM COS.
# This client needs to know where to connect, the credentials to use,
# and what signature protocol to use for authentication. The endpoint
# can be specified to be public or private.
cos = boto3.client('s3', endpoint_url=pub_endpoint,
                   aws_access_key_id=access_key,
                   aws_secret_access_key=secret_key,
                   region_name='us',
                   config=boto3.session.Config(signature_version='s3v4'))

# Since we've already uploaded the dataset to be worked on into cloud storage,
# now we just need to identify which object we want to use. This creates a JSON
# representation of request's response headers.
obj = cos.get_object(Bucket=bucket, Key=object_key)

# Now, because this is all REST API based, the actual contents of the file are
# transported in the request body, so we need to identify where to find the
# data stream containing the actual CSV file we want to analyze.
data = obj['Body'].read()

# Now we can read that data stream into a pandas dataframe.
df = pd.read_csv(io.BytesIO(data))

# This is just a trivial example, but we'll take that dataframe and just
# create a JSON document that contains the mean values for each column.
output = df.mean(axis=0, numeric_only=True).to_json()

# Now we can write that JSON file to COS as a new object in the same bucket.
cos.put_object(Bucket=bucket, Key=result_key, Body=output)