我有一个文本文件夹,我想编写一个脚本,将每个文本文件分成三个单词组,计算这些集合,然后将其写入csv文件,如下所示:"日期,三个单词的集合,频率,相对频率。"每个文本文件都有一个如下所示的标题:
2014.5.30RNC Chairman Priebus Statement.txt
2012.8.17Homeless Veterans Need More From Obama.txt
2012.9.6GLARING OMISSION #16/ Shinseki Glosses.txt
我在下面写了这个脚本,它什么也没做,但它也没有吐出错误信息。我认为这意味着我的正则表达式或嵌套循环出现了问题,但我不知道如何在没有错误消息的情况下解决这个问题。提前感谢您的帮助!
corpus_root = '/Users/jolijttamanaha/Desktop/thesis2/RNC/Data2'
for year in range(2015, 1990, 1):
for month in range(12, 9, 1):
speeches = PlaintextCorpusReader(corpus_root, r'^{}\.{}\.\d*[\s\S]*'.format(year,month))
raw = speeches.raw().lower()
tokens = nltk.word_tokenize(raw.encode('utf-8').translate(None, string.punctuation))
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
minscore = 1
numwords = len(raw)
print "Words in corpus:"
print numwords
c = csv.writer(open("RNCngramsbymonth.csv", "a"))
for k,v in fdist.items():
if v > minscore:
rf = Decimal(v)/Decimal(numwords)
firstword, secondword, thirdword = k #splits up the list hidden in k
trigram = firstword + " " + secondword + " " + thirdword #turns the list in k into one string
time = year + month
results = time,trigram,v,rf
c.writerow(results)
print firstword, secondword, thirdword, v, rf
print "Done with month {}".format(month)