我对Python很新,我已经找到了这个错误的答案,但是我没有足够的经验来确切地看到我出错的地方 - 它可能是非常基本的东西。
我正在开展一个项目,根据他们在文本中使用的单词来识别作者。我将每个作者的单词添加到字典中,单词作为键,值是单词出现在该作者的文本中的次数。我还创建了所有作者所有单词的词汇表,并使用它们来计算概率。这最初工作正常
当我添加k-fold交叉验证时,我的问题出现了,因为我的语料库并不是特别大。我遍历一个作者名称列表,它与我分配给他们空字典的名称相匹配。一旦我提取了我想要的文件,我想将清理/解析的文本添加到字典中,但是我得到了上面的错误,它引用了我的行 author [word] = 1 字典fn,我在下面的第二行代码中调用。从我对其他答案的解读,它与str是不可变的,但我只是看不出如何应用我的问题的答案。非常感谢您的帮助!
Ps我知道有些库可以完成所有这些工作,但项目的整个想法是建立我自己的模型,并将其与其他模型进行比较。
path = "C:\\......\The Letters\\"
#create an empty vocab set
vocab = set()
stop = stopwords.words('english')
snowball = SnowballStemmer('english')
#create empty dictionary for each author
AuthorA = {}
AuthorB = {}
AuthorC = {}
authorList = ["AuthorA","AuthorB","Authorc"]
#function to preprocess the words. Opens & reads file, removes non alphabet
#characters, converts to lowercase, and tokenizes
def cleanText(path,author,eachfile):
f= open(path+author+"\\"+eachfile, "r")
contents = f.read()
strip = re.sub('[^a-zA-Z]',' ',contents)
lowerCase = strip.lower()
allwords = lowerCase.split()
return allwords
#function to add words to the vocabulary set
def createVocab(allwords):
for word in allwords:
if len(word)>= 4:
vocab.update(allwords)
return
#function to add words to author dictionary and count occurrences of each word
def dictionary(allwords, author):
for word in allwords:
if len(word)>= 4:
if word in author:
author[word]= author[word]+1
else:
author[word]= 1
return
def main():
global authorList
global path
global vocab
global AuthorA
global AuthorB
global AuthorC
for author in authorList:
#filename and path
listing = os.listdir(path+author)
#specify parameters for k fold validation
#split into 10 folds and take a file form each fold
#repeat for until the entire directory has been split
folds = 10
subset_size = len(path+author)/folds
for i in range(folds):
#use these files to train the model
current_train = listing[:i*subset_size:]+listing[(i+1)*subset_size:]
#use these files to test the model
current_test = listing[i*subset_size:][:subset_size]
#iterate through the files selected by current_train variable
for eachfile in current_train:
#call function to parse text
allwords = cleanText(path,author,eachfile)
#call fn to add words to dictionary
dictionary(allwords, author)
#call fn to add words to vocab
createVocab(allwords)
答案 0 :(得分:1)
您将字典函数传递给变量作者的字符串。 top for循环,for author in authorList:
迭代字符串列表,而不是字典集合。 authorList = ["AuthorA","AuthorB","Authorc"]
您想要将dict集合传递给您的函数。希望有所帮助!