我正在计算给定文件中某些单词(在“词典”内)出现的次数。
虽然我的以下代码工作得很好,但它让眼睛感到痛苦,并且几乎可以肯定地使Python的Zen感到尴尬。
我非常感谢有关如何使“魔鬼循环”更清洁,更高效的任何提示。
每个乐谱必须具有自己的唯一计数器,并且每个词典都必须具有自己的唯一名称。那使我排除了在某种范围内循环的可能性。 完整背景 我大约有140,000个文本图块和9个“字典”,每个字典的总单词数不等。对于每个文件,我都会清理文本,然后计算给定文本文件中与9个词典中的每个单词相匹配的单词数。
for file in all_files:
# Extract firm and year identifiers from file names
cik_identifier = file[70:-4].split('_')[0]
financial_year = file[70:-4].split('_')[1]
filing_year = file[70:-4].split('_')[2]
filing_type = '10K'
# Conduct final cleaning of text file
with open(file) as my_file:
text = my_file.read()
words = text.split()
lower_case_words = [word.lower() for word in words]
alphabetic_only = [word for word in lower_case_words if word.isalpha()]
cleaned_words = \
[word for word in alphabetic_only if word not in stop_words]
# Log length of text doc pre and post clean
num_words_pre_clean = len(lower_case_words)
num_words_post_clean = len(cleaned_words)
# Calculate Sentiment Scores
first_sentiment_score = 0
second_sentiment_score = 0
third_sentiment_score = 0
fourth_sentiment_score = 0
fifth_sentiment_score = 0
sixth_sentiment_score = 0
seventh_sentiment_score = 0
eighth_sentiment_score = 0
ninth_sentiment_score = 0
# Goliath loop begins
for word in cleaned_words:
for first_sentiment_word, second_sentiment_word, third_sentiment_word, \
fourth_sentiment_word, fifth_sentiment_word, sixth_sentiment_word, \
seventh_sentiment_word, eighth_sentiment_word, ninth_sentiment_word in itertools.zip_longest(dict_first, dict_second,
dict_third, dict_fourth,
dict_fifth, dict_sixth,
dict_seventh, dict_eighth, dict_ninth):
if first_sentiment_word == word:
first_sentiment_score += 1
elif second_sentiment_word == word:
second_sentiment_score += 1
elif third_sentiment_word == word:
third_sentiment_score += 1
elif fourth_sentiment_word == word:
fourth_sentiment_score += 1
elif fifth_sentiment_word == word:
fifth_sentiment_score += 1
elif sixth_sentiment_word == word:
sixth_sentiment_score += 1
elif seventh_sentiment_word == word:
seventh_sentiment_score += 1
elif eighth_sentiment_word == word:
eighth_sentiment_score += 1
elif ninth_sentiment_word == word:
ninth_sentiment_score += 1
# Append identifier, num words, and trust score to df
sentiment_analysis_data = {'cik' : cik_identifier,
'financial_year_end' : financial_year,
'filing_year_end' : filing_year,
'filing_type' : filing_type,
'num_words_pre_clean' : num_words_pre_clean,
'num_words_post_clean' : num_words_post_cean,
'first_sentiment_score' : first_sentiment_score,
'second_sentiment_score' : second_sentiment_score,
'third_sentiment_score' : third_sentiment_score,
'fourth_sentiment_score' : fourth_sentiment_score,
'fifth_sentiment_score' : fifth_sentiment_score,
'sixth_sentiment_score' : sixth_sentiment_score,
'seventh_sentiment_score' : seventh_sentiment_score,
'eighth_sentiment_score' : eighth_sentiment_score,
'ninth_sentiment_score' : ninth_sentiment_score}
all_scores.append(sentiment_analysis_data)
答案 0 :(得分:0)
计数器列表仍然是一组唯一的计数器。
sentiment_scores = [0] * 9
词典列表仍然是一组独特的词典。
dicts = [dict_one, dict_two, ...] # etc
现在,您可以以不太可能使您蒙蔽的方式编写循环。
# Goliath loop begins
for word in cleaned_words:
for sentiment_words in itertools.zip_longest(*dicts):
for i, sentiment_word in enumerate(sentinment_words):
if sentiment_word == word:
sentiment_score[i] += 1
# Append identifier, num words, and trust score to df
sentiment_analysis_data = {'cik' : cik_identifier,
'financial_year_end' : financial_year,
'filing_year_end' : filing_year,
'filing_type' : filing_type,
'num_words_pre_clean' : num_words_pre_clean,
'num_words_post_clean' : num_words_post_cean,
'first_sentiment_score' : sentiment_score[0],
'second_sentiment_score' : sentiment_score[1],
'third_sentiment_score' : sentiment_score[2],
'fourth_sentiment_score' : sentiment_score[3],
'fifth_sentiment_score' : sentiment_score[4],
'sixth_sentiment_score' : sentiment_score[5],
'seventh_sentiment_score' : sentiment_score[6],
'eighth_sentiment_score' : sentiment_score[7],
'ninth_sentiment_score' : sentiment_score[8]}
理想情况下,sentimenat_analysis_data
可以采用单个键'sentiment_scores'
映射到分数列表,但是目前尚不清楚,您可以在哪里(如果有的话)进行更改。