昨天,我试图完成有关文本矢量化的Udacity的第11课。我遍历了代码,一切似乎都正常工作-我收到一些电子邮件,打开它们,删除一些签名词,然后将每封电子邮件的词干词返回到列表中。
这里是循环1:
for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
for path in from_person:
### only look at first 200 emails when developing
### once everything is working, remove this line to run over full dataset
# temp_counter += 1
if temp_counter < 200:
path = os.path.join('/xxx', path[:-1])
email = open(path, "r")
### use parseOutText to extract the text from the opened email
email_stemmed = parseOutText(email)
### use str.replace() to remove any instances of the words
### ["sara", "shackleton", "chris", "germani"]
email_stemmed.replace("sara","")
email_stemmed.replace("shackleton","")
email_stemmed.replace("chris","")
email_stemmed.replace("germani","")
### append the text to word_data
word_data.append(email_stemmed.replace('\n', ' ').strip())
### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
if from_person == "sara":
from_data.append(0)
elif from_person == "chris":
from_data.append(1)
email.close()
这里是循环2:
for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
for path in from_person:
### only look at first 200 emails when developing
### once everything is working, remove this line to run over full dataset
# temp_counter += 1
if temp_counter < 200:
path = os.path.join('/xxx', path[:-1])
email = open(path, "r")
### use parseOutText to extract the text from the opened email
stemmed_email = parseOutText(email)
### use str.replace() to remove any instances of the words
### ["sara", "shackleton", "chris", "germani"]
signature_words = ["sara", "shackleton", "chris", "germani"]
for each_word in signature_words:
stemmed_email = stemmed_email.replace(each_word, '') #careful here, dont use another variable, I did and broke my head to solve it
### append the text to word_data
word_data.append(stemmed_email)
### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
if name == "sara":
from_data.append(0)
else: # its chris
from_data.append(1)
email.close()
代码的下一部分按预期工作:
print("emails processed")
from_sara.close()
from_chris.close()
pickle.dump( word_data, open("/xxx/your_word_data.pkl", "wb") )
pickle.dump( from_data, open("xxx/your_email_authors.pkl", "wb") )
print("Answer to Lesson 11 quiz 19: ")
print(word_data[152])
### in Part 4, do TfIdf vectorization here
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words
print("SKLearn has this many Stop Words: ")
print(len(stop_words.ENGLISH_STOP_WORDS))
vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)
vectorizer.fit_transform(word_data)
feature_names = vectorizer.get_feature_names()
print('Number of different words: ')
print(len(feature_names))
但是当我用循环1计算单词的总数时,我得到了错误的结果。在循环2中进行操作时,我得到了正确的结果。
我一直在看这段代码太久了,我无法发现差异-在循环1中我做错了什么?
根据记录,我一直得到的错误答案是38825。正确答案应该是38757。
非常感谢您的帮助,亲切的陌生人!
答案 0 :(得分:3)
这些行什么也没做:
email_stemmed.replace("sara","")
email_stemmed.replace("shackleton","")
email_stemmed.replace("chris","")
email_stemmed.replace("germani","")
replace
返回一个新字符串,并且不修改email_stemmed
。相反,您应该将返回值设置为email_stemmed
:
email_stemmed = email_stemmed.replace("sara", "")
依此类推。
第二个循环确实在for循环中设置了返回值:
for each_word in signature_words:
stemmed_email = stemmed_email.replace(each_word, '')
上面的代码段并不相同,因为email_stemmed
的正确使用使第一个代码段replace
的末尾完全不变,而第二个代码段的末尾{{ 1}}实际上已经删除了每个单词。