我想开始一个机智学习课程。所以我下载了ud120-projects-master.zip文件并将其解压缩到我的下载文件夹中。我安装了anaconda jupyter笔记本(python 2.7)。
第一个迷你项目是Naïve-Bayes,所以我打开jupyter笔记本和%load nb_author_id.py转换成.ipynb 但我想我必须先在tools文件夹中运行startup.py来提取数据。
所以我运行了startup.ipynb。
# %load startup.py
print
print "checking for nltk"
try:
import nltk
except ImportError:
print "you should install nltk before continuing"
print "checking for numpy"
try:
import numpy
except ImportError:
print "you should install numpy before continuing"
print "checking for scipy"
try:
import scipy
except:
print "you should install scipy before continuing"
print "checking for sklearn"
try:
import sklearn
except:
print "you should install sklearn before continuing"
print
print "downloading the Enron dataset (this may take a while)"
print "to check on progress, you can cd up one level, then execute <ls -lthr>"
print "Enron dataset should be last item on the list, along with its current size"
print "download will complete at about 423 MB"
import urllib
url = "https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz"
urllib.urlretrieve(url, filename="../enron_mail_20150507.tgz")
print "download complete!"
print
print "unzipping Enron dataset (this may take a while)"
import tarfile
import os
os.chdir("..")
tfile = tarfile.open("enron_mail_20150507.tgz", "r:gz")
tfile.extractall(".")
print "you're ready to go!"
但是收到错误......
checking for nltk
checking for numpy
checking for scipy
checking for sklearn
downloading the Enron dataset (this may take a while)
to check on progress, you can cd up one level, then execute <ls -lthr>
Enron dataset should be last item on the list, along with its current size
download will complete at about 423 MB
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-1-c30fe1ced56a> in <module>()
32 import urllib
33 url = "https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz"
---> 34 urllib.urlretrieve(url, filename="../enron_mail_20150507.tgz")
35 print "download complete!"
36
这适用于nb_author_id.py:
# %load nb_author_id.py
#!/usr/bin/python
"""
This is the code to accompany the Lesson 1 (Naive Bayes) mini-project.
Use a Naive Bayes Classifier to identify emails by their authors
authors and labels:
Sara has label 0
Chris has label 1
"""
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
### your code goes here ###
#########################################################
错误/警告
C:\Users\jr31964\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
no. of Chris training emails: 7936
no. of Sara training emails: 7884
如何从NaïveBayes迷你项目开始,需要采取哪些先决条件。
答案 0 :(得分:1)
由于我在Python 3中假设该课程,我建议在python 3中创建一个conda环境。即使你有python 2的基本python安装,你也可以这样做。这应该可以节省你转换所有课程代码python 3到你的python 2.
conda create --name UdacityCourseEnvironment python=3.6
# to get into your new environment (mac/linux)
source activate UdacityCourseEnvironment
# to get into your new environment (windows)
activate UdacityCourseEnvironment
# When you need new packages inside your new environment
conda install nameOfPackage
答案 1 :(得分:0)
你做出了与Anaconda合作的正确决定 - 这解决了Python 2和Python 3之间的一系列不兼容问题以及各种软件包依赖关系。我这么做了,并且正在将代码转换为Python3(&amp; dependencies),因为我想要一个最新的环境&amp;我完成时的编程技巧;但那只是我。
显然,您可以忽略该弃用警告:sklearn 0.19.0仍然有效。任何试图在0.20.0之后运行此问题的人都会遇到问题。但是,如果您觉得它很烦人(像我一样),您可以编辑文件工具/ email_preprocess.py并更改以下行(注释中的原始文件):
# from sklearn import cross_validation
from sklearn.model_selection import train_test_split
和
#features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)
features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)
另外,因为某些安装依赖于其他安装。之前成功的安装(例如numpy)结果导致其他软件包的安装失败(例如scipy),因为其先决条件是numpy + mkl。如果您刚刚安装了numpy,则需要卸载并更换。查看更多内容(我已达到我的链接限制)https冒号// github dot com / scipy / scipy / issues / 7221
我遇到的下一个问题是,在我的机器上,enron_mail_20150507.tgz中的电子邮件文件的数量非常大,以至于在没有达到完成消息的情况下运行了几个小时:
print "you're ready to go!"
事实证明我的IDE(PyCharm)正在索引文件,因为它们正在解压缩,这就是杀死磁盘。由于索引文本文件是不必要的,我将其关闭为目录'maildir'。这允许startup.py完成。
您遇到的urllib错误是由于包中的更改:您需要将import语句更改为:
import urllib.request
...然后你的第34行(上面的错误信息)改为:
urllib.request.urlretrieve(url, filename="../enron_mail_20150507.tar.gz")
另请注意,github上的此链接非常有用:https://github.com/MLTO/general/wiki/Python-Setup-for-Udacity-ud120-course
此答案的其余部分与Windows 10有关,因此Linux用户可以跳过此步骤。
我遇到的下一个问题是某些软件包导入失败,原因是安装没有针对W10进行正确优化。解决此问题的宝贵资源是一组Windows优化的.whl(wheel)文件,可在http://www.lfd.uci.edu/~gohlke/pythonlibs/
找到下一个问题是解压缩.tgz文件引入了Linux和Windows文件之间可能熟悉的LF / CRLF字符问题。这里有来自@ monkshow92的github修复:(链接限制再次)https冒号// github dot com / udacity / ud120-projects / issues / 46
除此之外,这是轻而易举的事。