我正在尝试在AWS Redshift DB中创建Python用户定义标量(UDF)函数。 UDF包装以下Python代码:
CREATE or replace library nltk language plpythonu from 's3://xxx/dev/python-libraries/nltk-3.2.1.zip'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=yyy' region as 'eu-west-1';
CREATE or replace library textblob language plpythonu from 's3://xxx/dev/python-libraries/textblob-0.15.1-py2.py3-none-any.zip'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=yyy' region as 'eu-west-1';
CREATE or replace FUNCTION f_sentiment_polarity (comment varchar(1000)) RETURNS float IMMUTABLE as $$
from textblob import TextBlob
return TextBlob(comment).sentiment.polarity
$$ LANGUAGE plpythonu;
SELECT f_sentiment_polarity('this would be very useful if the corpora were loaded');
f_sentiment_polarity
--------------------
0
select
语句的结果为我提供了0
当我在本地环境中运行相同的Python代码时(使用NLTK v3.2.5的Windows上的Python 2.7,我得到0.39
:
Python 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from textblob import TextBlob
>>> TextBlob('this would be very useful if the corpora were loaded').sentiment.polarity
0.39
>>>
我认为这是因为各种NLTK Corpora尚未在AWS Redshift Python环境中加载。如下创建另一个Redshift UDF似乎证实了这一点:
CREATE or replace FUNCTION f_num_brown_words () RETURNS int IMMUTABLE as $$
from nltk.corpus import brown
return len(brown.words())
$$ LANGUAGE plpythonu;
select f_num_brown_words();
ERROR: XX000: LookupError:
**********************************************************************
Resource u'corpora/brown' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download()
Searched in:
- "'/'/nltk_data"
- '/usr/shar
问题:有没有办法在AWS Redshift Python环境中加载NLTK Corpora,以便我的UDF能够正常运行?