Question

我的问题与this类似，但略有不同。我正在尝试阅读文件，查找包含以＆＃39; From＆＃39;开头的电子邮件的行。然后创建一个字典来存储这些电子邮件，同时也提供最大的电子邮件地址。

文件中要查找的行是：

来自stephen.marquard@uct.ac.za 2008年1月5日星期六09:14:16

任何时候找到这个，都应该提取出电子邮件部分，然后在创建字典之前将其放在一个列表中。

我遇到了这个代码示例，用于在dict中打印最大键值：

n+1

从这个示例代码中我尝试了这个程序：

counts = dict()  
names = ['csev','owen','csev','zqian','cwen']  
for name in names:  
  counts[name] = counts.get(name,0) + 1  
  maximum = max(counts, key = counts.get)
print maximum, counts[maximum]

现在的问题是，只有27行开头，该列表中最高的重复电子邮件应为＆＃39; cwen@iupui.edu'这发生了5次，但是当我运行代码时，我的输出就变成了这个

import re

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)
    # loop through the list to acess each line individually
    for email in matches :
        # place values in variable
        out = email
        # looking through each line for any email add found
        found = re.findall(r'[\w\.-]+@[\w\.-]+', out)
        # loop through the found emails and print them out
        for i in found :
            i.split()
            addy.append(i)
            for i in addy:
                counts[i] = counts.get(i, 0) + 1
                maximum = max(counts, key=counts.get)
    print counts
    print maximum, counts[maximum]

以下是文本文件的链接，以便更好地理解：text file

Answer 1

你有几个问题。

第一个是为文本文件中的每一行调用for email in matches循环。

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)

# loop through the list to acess each line individually
for email in matches:

所以，通过这种改变，你知道一次迭代匹配。

然后，因为我们知道每个匹配中只有一个，我们可以将查找更改为：

found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]

要计算我们已经看到的每个人中有多少人发生了变化：

# loop through the found emails and print them out
for i in found :
    i.split()
    addy.append(i)
    for i in addy:
        counts[i] = counts.get(i, 0) + 1
        maximum = max(counts, key=counts.get)

更具可读性：

if found in counts:
    counts[found] += 1
else:
    counts[found] = 1

然后你可以在最后得到最大值，而不是一直保存它：

print counts
print max(counts, key=lambda x : x[1])

把它拿给你：

import re

name = raw_input("Enter file:")
if len(name) < 1 : 
    name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)

# loop through the list to acess each line individually
for email in matches:
    # place values in variable
    out = email
    # looking through each line for any email add found
    found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]
    # loop through the found emails and print them out
    if found in counts:
        counts[found] += 1
    else:
        counts[found] = 1

print counts
print max(counts, key=lambda x : x[1])

返回：

{'gopal.ramasammycook@gmail.com': 1, 'louis@media.berkeley.edu': 3, 'cwen@iupui.edu': 5, 'antranig@caret.cam.ac.uk': 1, 'rjlowe@iupui.edu': 2, 'gsilver@umich.edu': 3, 'david.horwitz@uct.ac.za': 4, 'wagnermr@iupui.edu': 1, 'zqian@umich.edu': 4, 'stephen.marquard@uct.ac.za': 2, 'ray@media.berkeley.edu': 1}
cwen@iupui.edu

Answer 2

lines.split()不会更改行，如i.split()中所示，请使用print来验证此临时值。

检查for循环是否按您的意愿执行。

import re
import collections

addy = []

with open("mbox-short.txt") as handle:
    for lines in handle :
        if not lines.startswith("From ") : continue
        found = re.search(r'[\w\.-]+@[\w\.-]+', lines).group()
        addy.append(found.split('@')[0])
print collections.Counter(addy).most_common(1)
# out: [('cwen', 5)]

Answer 3

您在matches以及found上的循环没有正确的缩进。首先，迭代文件中的所有行，并将所有以“From”开头的行添加到匹配项中。之后你必须迭代这些比赛。类似地，对于匹配的行，您将所有电子邮件地址添加到addy。之后您必须遍历此列表。即，

for lines in handle :
    # look for specific characters in document text
    ...

for email in matches :
    ...

    for i in found :
        i.split()
        addy.append(i)

    for i in addy:
        counts[i] = counts.get(i, 0) + 1
        maximum = max(counts, key=counts.get)

Answer 4

在对我的代码进行进一步反思之后，得到了与@Noelkd几乎相似的答案：

import re

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)

email_matches = []
found_emails = []
final_emails = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    email_matches.append(lines)

for email in email_matches :
    out = email
    found = re.findall(r'[\w\.-]+@[\w\.-]+',  out)
    found_emails.append(found)

for item in found_emails :
    count = item[0]
    final_emails.append(count)

for items in final_emails:
    counts[items] = counts.get(items,0) + 1
    maximum = max(counts, key = lambda x: counts.get(x))
print maximum, counts[maximum]

输出

cwen@iupui.edu 5

使用字典在文本文件中查找最常出现的单词

4 个答案: