Question

我在AWS Elastic Beanstalk上运行了一个django应用程序。我使用nltk语料库包（stopwords），这是我使用NLTK下载程序获得的。

对于快速入侵，我只是在当前（单个）弹性beanstalk EC2实例上运行了nltk下载程序，并将所需的语料库保存到/usr/local/share/nltk_data。这适用于单个实例，但显然当我的负载均衡器决定创建新实例时，这将被擦除（它在部署中幸存）。

我的问题是这里专门针对这些数据的最佳方法是什么？

我应该将它存储在S3上并将其与弹性豆茎相连吗？

或者，是否更容易（也更好）编写一个（python？）脚本，该脚本将由EB配置为每个新实例调用，该实例将下载并将数据放入应用程序可访问的文件夹中（终身实例）？这样，如果我需要添加其他语料库下载或执行python特定或nltk特定的事情，它在python中发生而不需要手动S3工作？

如果有人支持为EB配置编写脚本，那么一个例子会很棒，我不知道如何准确地执行此操作。

谢谢！

Answer 1

在这个特定用例中使用S3非常容易（与IAM和EC2实例角色结合使用）。

即使数据变化很快（我假设nltk语料库变化缓慢），也可以手动将差异同步到现有的s3位置，以便您的实例在需要时可以获得新数据。

关键是使用Instance Profiles为您的实例提供IAM角色。通过适当的策略，他们可以安全地访问s3，而无需手动定义您的aws凭据，或者在需要在实例启动时访问AWS CLI的脚本等。

将实例配置文件用于AWS资源的IAM权限具有显着的安全优势，因为它可以将硬编码凭据消除到脚本，git代码等中。

然后假设通过apt，pip等在linux上安装AWS CLI：

 # create the bucket (once). 
 # put in a region / az where your ec2 instances are 
 # to minimize data xfer

 # can run these from wherever to get your bucket / data up
 aws s3 mb s3://mybucket --region us-west-1

 # sync from wherever the first time & whenever needed
 aws s3 sync /usr/local/share/nltk_data s3://mybucket


 # can run the below on your instances
 #
 # put instance startup script after install of awscli etc.
 # or in myscript.sh file on your instance (even a gist)
 # wherever you want an instance to have your data or sync up

 aws s3 sync s3://mybucket/nltk_data /path/where/i/need

关于sync命令的好处是它不会复制在接受s3和拉下时尚未修改的文件。这使得它非常适用于常见数据集，备份等等。

Answer 2

虽然我最终会测试另一个答案是否更适用于更复杂的nltk软件包，但是停用词实际上只是一个列表（或者我猜你需要多种语言的列表），你可以剪切并粘贴到你的脚本中：

>>> from nltk.corpus import stopwords
>>> stopwordlist = stopwords.words('english')
>>> print(stopwordlist)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

所以我只是直接在我的脚本中定义它而不导入任何东西：

stopwordlist = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

在AWS Elastic Beanstalk上使用下载的NLTK数据

2 个答案: