我有两个文本文件。其中一个是整个文本(text1
),另一个是text1
中唯一单词的数量。我需要计算一个会标,然后将其写入文件中。我已经尝试过了:
def countwords(mytext):
import codecs
file = codecs.open(mytext, 'r', 'utf_8')
count = 0
mytext = file.readlines()
for line in mytext:
words = line.split()
for word in words:
count = count + 1
file.close()
return(count)
def CalculateMonoGram(path, lex):
fid = open(path, 'r', encoding='utf_8')
mypath = fid.read().split()
fid1 = open(lex, 'r', encoding='utf_8')
mylex = fid1.read().split()
for word1 in mylex:
if word1 in mypath:
x = dict((word1, mypath.count(word1)) for word1 in mylex)
for value in x:
monogram = '\t' + str(value / countwords(lex))
table.write(monogram)
答案 0 :(得分:3)
import re
import collections
with open("input.txt") as f1, open("sub_input.txt") as f2:
pattern = "[^a-zA-Z]"
frequencies = collections.Counter([re.sub(pattern, "", word.strip()) for line in f1.readlines() for word in line.split()])
print [frequencies[word] for line in f2.readlines() for word in line.split()]
以上为 [4, 2]
:
input.txt
asd,
asd. lkj lkj sdf
sdf .asd wqe qwe kl
dsf asd,. wqe
和 sub_input.txt
:
asd sdf
collections.Counter(iterable)
构造一个无序集合,其中包含可迭代的元素dictionary键以及它们作为字典值出现的次数。[^a-zA-Z]
匹配任何不在a-z
或A-Z
范围内的字符。 re.sub(pattern, substitute, string
将pattern
匹配的子字符串替换为substitute
中的string
。在这种情况下,用空字符串替换所有非字母字符。