Question

我有一个拥有.txt的文件，其中有多行包含句子。假设该文件名为sentences.txt。我还有一本正在使用的字典，其中包含约2500个单词的预定义情感，我们将其称为sentiment_scores字典。我的目标是返回一个字典，该字典可以预测不在sentiment_scores中的单词的情感值。我通过获取单词所在每个句子的平均分数来做到这一点。

with open('sentences.txt', 'r') as f:
        sentences = [line.strip() for line in f]
        f.close()

for line in sentences:
    for word in line.split(): #This will iterate through words in the sentence
        if not (word in sentiment_scores):
            new_term_sent[word] = 0 #Assign word a sentiment value of 0 initially

for key in new_term_sent:

    score = 0
    num_sentences = 0
    for sentence in sentences:
        if key in sentence.split():
            num_sentences+=1
            val = get_sentiment(sentence) #This function returns the sentiment of a sentence
            score+=val
    if num_sentences != 0:
        average = round((score)/(num_sentences),1)
        new_term_sent[key] = average


return new_term_sent

请注意：此方法有效，但是时间复杂度太长，在笔记本电脑上运行大约需要80秒。

因此，我的问题是如何更有效地做到这一点？我尝试仅在.readlines()上使用sentence.txt，但这没有用（无法弄清楚为什么，但是我知道它与多次遍历文本文件有关；也许指针以某种方式消失了）。预先谢谢你！

Answer 1

除了使用可能非常复杂的并发之外，您还可以优化循环。如果一个句子中的所有单词都是唯一的，并且该句子平均有M个单词，则当前代码会在同一句子上调用M次遍历compute_sentiment。

不是将所有单个单词都放入new_term_sent并将值初始化为零，而是将每个单个单词映射到一个空列表。然后，您可以为每个句子计算一次情感，并将该值附加到该句子中出现的所有单词。

word_to_scores = defaultdict(list)
for sentence in sentences:
    sentence_sentiment = compute_sentiment(sentence)
    for word in line.split():              
        word_to_scores[word].append(sentence_sentiment) 

for word,sentence_sentiments in word_to_scores.items():
    new_term_sent[word] = sentence_sentiments/len(sentence_sentiments)

P.S。原始代码以及假设每行都是一个单独的句子。我不确定这个假设对您是否合适。

P.P.S。我认为以下代码段从未被调用过。循环仅迭代字典中的键，但字典中的所有键以前都出现在某些句子中，因此num_sentences始终> = 1。

if num_sentences != 0:
    average = round((score)/(num_sentences),1)
    new_term_sent[key] = average

通过python中的文本文件进行多次迭代

1 个答案: