查找并计算多个文件中已知单词对的频率

时间:2013-06-09 12:39:33

标签: python string python-3.x sequence

基本上我需要计算多个文件中的字对数。我在名为result.txt的文件中有一个单词对列表,如下所示:

  1. the by
  2. 他们是
  3. 分组他们的
  4. 我想检查位于给定目录中的许多文本文件中这些对的频率,并按降序打印对序列和相应的频率。输出必须采用以下形式:

    1. 分组他们的205
    2. 他们是180
    3. of 56
    4. 我已经尝试了以下内容:

      import os
      import re
      from collections import Counter
      from glob import iglob
      from collections import defaultdict
      import itertools as it
      
      folderpath = 'path/to/directory'
      pairs=defaultdict(int)
      
      logfile = open('result.txt', 'r')
      loglist = logfile.readlines()
      logfile.close()
      found = False
      for line in loglist:
          for filepath in iglob(os.path.join(folderpath,'*.txt')):
              with open(filepath,'r') as filehandle:
                  for pair in it.combinations(re.findall('\w+',line),2):
                      pairs[tuple(pair)]+=1
          found=True                    
      resultList=[pair+(occurences, ) for pair, occurences in pairs.iterkeys()]
      

      但它没有给我预期的结果。我将不胜感激任何帮助!

1 个答案:

答案 0 :(得分:0)

使用combinations()时,您将获得所有对,甚至是非相邻对。您可以创建一个返回相邻对的函数。我已经尝试了以下代码并且它有效,也许它可以给你一些见解:

import os
import re
from collections import Counter

def pairs(text):
    ans = re.findall(r'[A-Za-z]+', text)
    return (tuple(ans[i:i+2]) for i in xrange(len(ans)-1))

mypairs = tuple([ tuple(line.split()[-2:]) for line in open('results.txt')])

c = Counter()
folderpath = 'path/to/directory'
for dirpath, dnames, fnames in os.walk(folderpath):
    for f in fnames:
        if not '.txt' in f: continue
        for line in open(os.path.join(dirpath, f)):
            c += Counter(p for p in pairs(line) if p in mypairs)

for item in c.most_common():
    print item