我的问题与this类似,但略有不同。我正在尝试阅读文件,查找包含以' From'开头的电子邮件的行。然后创建一个字典来存储这些电子邮件,同时也提供最大的电子邮件地址。
文件中要查找的行是:
来自stephen.marquard@uct.ac.za 2008年1月5日星期六09:14:16
任何时候找到这个,都应该提取出电子邮件部分,然后在创建字典之前将其放在一个列表中。
我遇到了这个代码示例,用于在dict中打印最大键值:
n+1
从这个示例代码中我尝试了这个程序:
counts = dict()
names = ['csev','owen','csev','zqian','cwen']
for name in names:
counts[name] = counts.get(name,0) + 1
maximum = max(counts, key = counts.get)
print maximum, counts[maximum]
现在的问题是,只有27行开头,该列表中最高的重复电子邮件应为' cwen@iupui.edu'这发生了5次,但是当我运行代码时,我的输出就变成了这个
import re
name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()
for lines in handle :
# look for specific characters in document text
if not lines.startswith("From ") : continue
# increment the count variable for each math found
lines.split()
# append the required lines to the matches list
matches.append(lines)
# loop through the list to acess each line individually
for email in matches :
# place values in variable
out = email
# looking through each line for any email add found
found = re.findall(r'[\w\.-]+@[\w\.-]+', out)
# loop through the found emails and print them out
for i in found :
i.split()
addy.append(i)
for i in addy:
counts[i] = counts.get(i, 0) + 1
maximum = max(counts, key=counts.get)
print counts
print maximum, counts[maximum]
以下是文本文件的链接,以便更好地理解:text file
答案 0 :(得分:1)
你有几个问题。
第一个是为文本文件中的每一行调用for email in matches
循环。
for lines in handle :
# look for specific characters in document text
if not lines.startswith("From ") : continue
# increment the count variable for each math found
lines.split()
# append the required lines to the matches list
matches.append(lines)
# loop through the list to acess each line individually
for email in matches:
所以,通过这种改变,你知道一次迭代匹配。
然后,因为我们知道每个匹配中只有一个,我们可以将查找更改为:
found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]
要计算我们已经看到的每个人中有多少人发生了变化:
# loop through the found emails and print them out
for i in found :
i.split()
addy.append(i)
for i in addy:
counts[i] = counts.get(i, 0) + 1
maximum = max(counts, key=counts.get)
更具可读性:
if found in counts:
counts[found] += 1
else:
counts[found] = 1
然后你可以在最后得到最大值,而不是一直保存它:
print counts
print max(counts, key=lambda x : x[1])
把它拿给你:
import re
name = raw_input("Enter file:")
if len(name) < 1 :
name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()
for lines in handle :
# look for specific characters in document text
if not lines.startswith("From ") : continue
# increment the count variable for each math found
lines.split()
# append the required lines to the matches list
matches.append(lines)
# loop through the list to acess each line individually
for email in matches:
# place values in variable
out = email
# looking through each line for any email add found
found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]
# loop through the found emails and print them out
if found in counts:
counts[found] += 1
else:
counts[found] = 1
print counts
print max(counts, key=lambda x : x[1])
返回:
{'gopal.ramasammycook@gmail.com': 1, 'louis@media.berkeley.edu': 3, 'cwen@iupui.edu': 5, 'antranig@caret.cam.ac.uk': 1, 'rjlowe@iupui.edu': 2, 'gsilver@umich.edu': 3, 'david.horwitz@uct.ac.za': 4, 'wagnermr@iupui.edu': 1, 'zqian@umich.edu': 4, 'stephen.marquard@uct.ac.za': 2, 'ray@media.berkeley.edu': 1}
cwen@iupui.edu
答案 1 :(得分:1)
lines.split()
不会更改行,如i.split()
中所示,请使用print来验证此临时值。检查for
循环是否按您的意愿执行。
import re
import collections
addy = []
with open("mbox-short.txt") as handle:
for lines in handle :
if not lines.startswith("From ") : continue
found = re.search(r'[\w\.-]+@[\w\.-]+', lines).group()
addy.append(found.split('@')[0])
print collections.Counter(addy).most_common(1)
# out: [('cwen', 5)]
答案 2 :(得分:0)
您在matches
以及found
上的循环没有正确的缩进。首先,迭代文件中的所有行,并将所有以“From”开头的行添加到匹配项中。 之后你必须迭代这些比赛。类似地,对于匹配的行,您将所有电子邮件地址添加到addy
。 之后您必须遍历此列表。即,
for lines in handle :
# look for specific characters in document text
...
for email in matches :
...
for i in found :
i.split()
addy.append(i)
for i in addy:
counts[i] = counts.get(i, 0) + 1
maximum = max(counts, key=counts.get)
答案 3 :(得分:0)
在对我的代码进行进一步反思之后,得到了与@Noelkd几乎相似的答案:
import re
name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
email_matches = []
found_emails = []
final_emails = []
counts = dict()
for lines in handle :
# look for specific characters in document text
if not lines.startswith("From ") : continue
# increment the count variable for each math found
lines.split()
# append the required lines to the matches list
email_matches.append(lines)
for email in email_matches :
out = email
found = re.findall(r'[\w\.-]+@[\w\.-]+', out)
found_emails.append(found)
for item in found_emails :
count = item[0]
final_emails.append(count)
for items in final_emails:
counts[items] = counts.get(items,0) + 1
maximum = max(counts, key = lambda x: counts.get(x))
print maximum, counts[maximum]
输出
cwen@iupui.edu 5