Question

我正在尝试编写一个简单的程序，您可以在其中粘贴某个电子邮件的文本，并且该程序会打印哪个dic与电子邮件重叠最多。

这里简要介绍一下我正在努力实现的目标：

用户将电子邮件复制并粘贴到程序
文本存储到变量
变量附加到文本文件（构建所有电子邮件的简单数据库）
将变量中的每个单词与4个不同类别（dics）中的单词进行比较
每当电子邮件中的单词与dic中的单词匹配时，一些单独的变量就会跟踪这一点。
最后程序显示预测。所以所有4个类别都包含了电子邮件中的字词数。

到目前为止，我的程序将输入保存为单个小写字符串，并将其附加到文本文件中。

那么如何迭代文本文件最新条目中的每个单词，并检查4个dics中哪一个具有最相似的单词。

这是我到目前为止所做的。

content = ""
line = input(">")

while line != "EOF":
    line = line.lower()
    content += line
    line = input(">")

file = open('Email_file.txt','a')
file.write('-- START --' + '\n' + content + "\n" + '-- FINISH --')

list_1={
    'word1':1,
    'word2':1,
    'word4':1}

list_2={
    'word5':1,
    'word6':1,
    'word7':1}

list_3={
    'word8':1,
    'word9':1,
    'word10':1}

list_4={
    'word11':1,
    'word12':1,
    'word13':1}

给出一些关于我想如何使用它的背景知识：我收到很多通常可分为4种类型之一的电子邮件。我想编写一个程序，根据电子邮件中的单词预测每个类别的可能性，而不是手动对每个邮件进行分类。我想稍后通过询问用户预测是否正确来添加一个小机器学习部分，如果这样增加了dics中单词背后的数字，那么我可以稍后将其转换为每个单词的重量。但这一切都是为了以后。现在我只想将电子邮件的内容与4个列表进行比较，并打印哪个列表具有最多的相应单词。

----更新--- 当我尝试运行您的代码时：来自馆藏进口柜台

a = Counter({
    'hello':1,
    'bye':1,
    'see you':1,})

b = Counter({
    'tomorrow':1,
    'today':1,
    'last week':1,})

c = Counter({
    'walk':1,
    'bike':1,
    'swim':1,
    'run':1,})

with open("emailfile.txt") as f:
    # for every line
    for line in f:
        # split line into words
        spl = line.split()
        # update count for each word set
        # a.keys() & spl finds any common words
        a.update(a.keys() & spl) # .viewkeys() for python2
        b.update(b.keys() & spl)
        c.update(c.keys() & spl)

# find word set with most occurrences
print(max((a, b, c), key=lambda x: sum(x.values())))

和emailfile.txt包含：

你好，上周我买了自行车，今天我骑自行车。然后我走了。

它打印：计数器（{'bike'：2，'walk'：1，'swim'：1，'run'：1}）

我不知道它做了什么，因为即使它在文件中，它仍然保持'步行'1。

我想要它打印像：最高的对应：C与3个相似的单词

谢谢！

Answer 1

这是一个粗略的想法，如何使用collections.Counter dicts计算单词并找到文件中最常出现的单词：

from collections import Counter
from string import punctuation

# Counter dicts will count the occurrences for each word set
a = Counter({
    'word1': 0,
    'word2': 0,
    'word4': 0})

b = Counter({
    'word5': 0,
    'word6': 0,
    'word7': 0})

c = Counter({
    'word8': 0,
    'word9': 0,
    'word10': 0})

d = Counter({
    'word11': 0,
    'word12': 0,
    'word13': 0})

with open("emailfile.txt") as f:
    # for every line
    for line in f:
        # split line into words, remove punctuation and lower
        spl = [w.strip(punctuation).lower() for w in line.split()]
        # update count for each word set
        # a.keys() & spl finds any common words
        a.update(a.keys() & spl) # .viewkeys() for python2
        b.update(b.keys() & spl)
        c.update(c.keys() & spl)
        d.update(d.keys() & spl)

# find word set with most occurrences
print(max((a, b, c, d), key=lambda x: sum(x.values())))

或者将单词存储在单独的列表中并使用单个Counter dict：

a = [..]
b = [...]
 # etc...
from itertools import chain
with open("emailfile.txt") as f:
    cn = Counter(chain(*(w.strip(punctuation).lower() for w in line.split()) for line in f))


print(max((a, b, c, d), key=lambda x: sum(cn[w] for w in x)))

电子邮件扫描程序帮助 - ＆gt;将邮件中的单词与dictonary

1 个答案: