使用字典在文本文件中查找最常出现的单词

时间:2016-08-12 09:24:52

标签: python dictionary

我的问题与this类似,但略有不同。我正在尝试阅读文件,查找包含以' From'开头的电子邮件的行。然后创建一个字典来存储这些电子邮件,同时也提供最大的电子邮件地址。

文件中要查找的行是:

  

来自stephen.marquard@uct.ac.za 2008年1月5日星期六09:14:16

任何时候找到这个,都应该提取出电子邮件部分,然后在创建字典之前将其放在一个列表中。

我遇到了这个代码示例,用于在dict中打印最大键值:

n+1

从这个示例代码中我尝试了这个程序:

counts = dict()  
names = ['csev','owen','csev','zqian','cwen']  
for name in names:  
  counts[name] = counts.get(name,0) + 1  
  maximum = max(counts, key = counts.get)
print maximum, counts[maximum]

现在的问题是,只有27行开头,该列表中最高的重复电子邮件应为' cwen@iupui.edu'这发生了5次,但是当我运行代码时,我的输出就变成了这个

import re

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)
    # loop through the list to acess each line individually
    for email in matches :
        # place values in variable
        out = email
        # looking through each line for any email add found
        found = re.findall(r'[\w\.-]+@[\w\.-]+', out)
        # loop through the found emails and print them out
        for i in found :
            i.split()
            addy.append(i)
            for i in addy:
                counts[i] = counts.get(i, 0) + 1
                maximum = max(counts, key=counts.get)
    print counts
    print maximum, counts[maximum]

以下是文本文件的链接,以便更好地理解:text file

4 个答案:

答案 0 :(得分:1)

你有几个问题。

第一个是为文本文件中的每一行调用for email in matches循环。

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)

# loop through the list to acess each line individually
for email in matches:

所以,通过这种改变,你知道一次迭代匹配。

然后,因为我们知道每个匹配中只有一个,我们可以将查找更改为:

found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]

要计算我们已经看到的每个人中有多少人发生了变化:

# loop through the found emails and print them out
for i in found :
    i.split()
    addy.append(i)
    for i in addy:
        counts[i] = counts.get(i, 0) + 1
        maximum = max(counts, key=counts.get)

更具可读性:

if found in counts:
    counts[found] += 1
else:
    counts[found] = 1

然后你可以在最后得到最大值,而不是一直保存它:

print counts
print max(counts, key=lambda x : x[1])

把它拿给你:

import re

name = raw_input("Enter file:")
if len(name) < 1 : 
    name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)

# loop through the list to acess each line individually
for email in matches:
    # place values in variable
    out = email
    # looking through each line for any email add found
    found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]
    # loop through the found emails and print them out
    if found in counts:
        counts[found] += 1
    else:
        counts[found] = 1

print counts
print max(counts, key=lambda x : x[1])

返回:

{'gopal.ramasammycook@gmail.com': 1, 'louis@media.berkeley.edu': 3, 'cwen@iupui.edu': 5, 'antranig@caret.cam.ac.uk': 1, 'rjlowe@iupui.edu': 2, 'gsilver@umich.edu': 3, 'david.horwitz@uct.ac.za': 4, 'wagnermr@iupui.edu': 1, 'zqian@umich.edu': 4, 'stephen.marquard@uct.ac.za': 2, 'ray@media.berkeley.edu': 1}
cwen@iupui.edu

答案 1 :(得分:1)

  1. lines.split()不会更改行,如i.split()中所示,请使用print来验证此临时值。
  2. 检查for循环是否按您的意愿执行。

    import re
    import collections
    
    addy = []
    
    with open("mbox-short.txt") as handle:
        for lines in handle :
            if not lines.startswith("From ") : continue
            found = re.search(r'[\w\.-]+@[\w\.-]+', lines).group()
            addy.append(found.split('@')[0])
    print collections.Counter(addy).most_common(1)
    # out: [('cwen', 5)]
    

答案 2 :(得分:0)

您在matches以及found上的循环没有正确的缩进。首先,迭代文件中的所有行,并将所有以“From”开头的行添加到匹配项中。 之后你必须迭代这些比赛。类似地,对于匹配的行,您将所有电子邮件地址添加到addy之后您必须遍历此列表。即,

for lines in handle :
    # look for specific characters in document text
    ...

for email in matches :
    ...

    for i in found :
        i.split()
        addy.append(i)

    for i in addy:
        counts[i] = counts.get(i, 0) + 1
        maximum = max(counts, key=counts.get)

答案 3 :(得分:0)

在对我的代码进行进一步反思之后,得到了与@Noelkd几乎相似的答案:

import re

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)

email_matches = []
found_emails = []
final_emails = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    email_matches.append(lines)

for email in email_matches :
    out = email
    found = re.findall(r'[\w\.-]+@[\w\.-]+',  out)
    found_emails.append(found)

for item in found_emails :
    count = item[0]
    final_emails.append(count)

for items in final_emails:
    counts[items] = counts.get(items,0) + 1
    maximum = max(counts, key = lambda x: counts.get(x))
print maximum, counts[maximum]

输出

cwen@iupui.edu 5