在python中计算会标

时间:2015-05-08 05:06:29

标签: python python-3.4

我有两个文本文件。其中一个是整个文本(text1),另一个是text1中唯一单词的数量。我需要计算一个会标,然后将其写入文件中。我已经尝试过了:

def countwords(mytext):
    import codecs
    file = codecs.open(mytext, 'r', 'utf_8')
    count = 0
    mytext = file.readlines()
    for line in mytext:
       words = line.split()
         for word in words:
            count = count + 1
         file.close()
    return(count)

def CalculateMonoGram(path, lex): 
     fid = open(path, 'r', encoding='utf_8')
     mypath = fid.read().split()
     fid1 = open(lex, 'r', encoding='utf_8')
     mylex = fid1.read().split()
     for word1 in mylex:
         if word1 in mypath:
             x = dict((word1, mypath.count(word1)) for word1 in mylex)
         for value in x:
             monogram = '\t' + str(value / countwords(lex))
             table.write(monogram)

1 个答案:

答案 0 :(得分:3)

您可以使用 collections.Counterre.sub

import re
import collections
with open("input.txt") as f1, open("sub_input.txt") as f2:
  pattern = "[^a-zA-Z]"
  frequencies = collections.Counter([re.sub(pattern, "", word.strip()) for line in f1.readlines() for word in line.split()])
  print [frequencies[word] for line in f2.readlines() for word in line.split()]

以上为 [4, 2]

打印input.txt
asd,
asd. lkj lkj  sdf
sdf .asd  wqe qwe kl
dsf asd,. wqe

sub_input.txt

asd sdf

如果代码不清楚,请将其分解:

  • collections.Counter(iterable)构造一个无序集合,其中包含可迭代的元素dictionary键以及它们作为字典值出现的次数。
  • regex模式[^a-zA-Z]匹配任何不在a-zA-Z范围内的字符。 re.sub(pattern, substitute, stringpattern匹配的子字符串替换为substitute中的string。在这种情况下,用空字符串替换所有非字母字符。