所以我正在研究这个文本挖掘项目。我试图打开所有文件,获取组织和摘要的信息,在摘要中拆分单词,然后找出每个单词显示的文件数量。 我的问题是关于最后一步:一个单词显示多少个文件?为了回答这个问题,我正在制作一个字典词频率来计算。我想告诉字典:如果单词没有在字典中显示,请捕获附加到它的单词和文件号;如果单词显示在字典中,但文件编号与任何现有文件编号不同,请在其后面附加文件编号。如果单词及其文件编号都已在字典中,请忽略它。以下是我的代码。
capturedfiles = []
capturedabstracts = []
wordFrequency = {}
wordlist=open('test.txt','w')
worddict=open('test3.txt','w')
for filepath in matches[0:5]:
with open (filepath,'rt') as mytext:
mytext=mytext.read()
#print mytext
# code to capture file organizations.
grabFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)
if len(grabFile) == 0:
matchFile= "N/A"
else:
matchFile = grabFile[0]
capturedfiles.append(matchFile)
# code to capture file abstracts
grabAbs=re.findall(r'Abstract\s\:\s\d{7}\s(\w.+)',mytext)
if len(grabAbs) == 0:
matchAbs= "N/A"
else:
matchAbs = grabAbs
capturedabstracts.append(matchAbs)
# arrange words in format.
lineCount = 0
wordCount = 0
lines = matchAbs[0].split('. ')
for line in lines:
lineCount +=1
for word in line.split(' '):
wordCount +=1
wordlist.write(matchFile + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')
if word not in wordFrequency:
wordFrequency[word]=[matchFile]
else:
if matchFile not in wordFrequency[word]:
wordFrequency[word].append(matchFile)
worddict.write(word + '|' + str(matchFile) + '\n')
wordlist.close()
worddict.close()
我现在得到的是每个单词都以其匹配的文件编号打印出来。如果一个单词在整个文本中出现两次,它将分别打印两次。下面是一个示例:
变异| a9500006 是| a9500006 是| a9500007
我希望它看起来像:
变异| a9500006 是| a9500006,a9500007
答案 0 :(得分:0)
不是每次在循环中写入worddict
,而是在构建之后编写整个wordFrequency
字典。像这样:
#assuming wordFrequency is a correctly built dictionary
for key, value in wordFrequency.items():
#key is a word, value is a list
worddict.write(key + '|')
for word in value:
#write each word in value
worddict.write(word)
#if it's not the last word, write a comma
if word != value[-1]:
worddict.write(', ')
#no more words, end line
worddict.write('\n')
PS: 从来没有,永远混合标签和空格!特别是在python中!