NLTK数据可以安装在AWS Redshift环境中吗?

时间:2018-01-28 13:15:45

标签: python amazon-web-services nltk amazon-redshift textblob

我正在尝试在AWS Redshift DB中创建Python用户定义标量(UDF)函数。 UDF包装以下Python代码:

CREATE or replace library nltk language plpythonu from 's3://xxx/dev/python-libraries/nltk-3.2.1.zip'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=yyy' region as 'eu-west-1';

CREATE or replace library textblob language plpythonu from 's3://xxx/dev/python-libraries/textblob-0.15.1-py2.py3-none-any.zip'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=yyy' region as 'eu-west-1';

CREATE or replace FUNCTION f_sentiment_polarity (comment varchar(1000)) RETURNS float IMMUTABLE as $$
from textblob import TextBlob
return TextBlob(comment).sentiment.polarity
$$ LANGUAGE plpythonu;

SELECT f_sentiment_polarity('this would be very useful if the corpora were loaded');

f_sentiment_polarity
--------------------
                   0

select语句的结果为我提供了0

当我在本地环境中运行相同的Python代码时(使用NLTK v3.2.5的Windows上的Python 2.7,我得到0.39

Python 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from textblob import TextBlob
>>> TextBlob('this would be very useful if the corpora were loaded').sentiment.polarity
0.39
>>>

我认为这是因为各种NLTK Corpora尚未在AWS Redshift Python环境中加载。如下创建另一个Redshift UDF似乎证实了这一点:

CREATE or replace FUNCTION f_num_brown_words () RETURNS int IMMUTABLE as $$
from nltk.corpus import brown
return len(brown.words())
$$ LANGUAGE plpythonu;

select f_num_brown_words();

ERROR: XX000: LookupError: 
**********************************************************************
  Resource u'corpora/brown' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - "'/'/nltk_data"
    - '/usr/shar

问题:有没有办法在AWS Redshift Python环境中加载NLTK Corpora,以便我的UDF能够正常运行?

1 个答案:

答案 0 :(得分:0)

您可以在群集中加载自定义库,在official docs中添加更多信息。

我按照说明进行了操作,并为我工作了另一个图书馆。