Question

[任务]

编写程序以读取文本文件，并计算每个消息的按小时分布。你可以从＆＃39; From＆＃39;通过找到时间，然后使用冒号第二次拆分字符串。

文本文件行的示例：

＆＃34;来自lauren.marquard@oul.ab.bc 2015年1月5日星期五09:14:16＆＃34;

累积每小时的计数后，打印出按小时排序的计数，如下所示。

[预期结果]

这意味着我需要退出＆＃34; 09：14：16＆＃34;部分，然后拉出时间＆＃34; 09＆＃34;再一次。

我将使用＆＃39;＃＆＃39;评论我在下面做过的事情

[我的代码]

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"     #if nothing is entered by user, it goes straight to the desired file
handle = open(name, 'r')     # open and read the file
count = dict()     # initialise count to a empty dictionary
for text in handle:     #for loop to loop through lines in the file
    text = text.rstrip()     #r.strip() to to remove any newline "\n"
    if not text.startswith('From '): continue     # find lines that starts with "From "
    text = text.split()         #split the line into list of words
    line = text[5]              #time is located at the [5] index
    time = line.split(':')     #split once more to get the hour 
    hour = time[0]            #hour is on the [0] index    
    count[hour] = count.get(hour, 0) + 1
    print count

[我的结果]

{'09': 1} ← Mismatch
{'09': 1, '18': 1}
{'09': 1, '18': 1, '16': 1}
{'09': 1, '18': 1, '16': 1, '15': 1}
{'09': 1, '18': 1, '16': 1, '15': 2}
{'09': 1, '18': 1, '16': 1, '15': 2, '14': 1}
{'09': 1, '18': 1, '16': 1, '15': 2, '14': 1, '11': 1}
{'09': 1, '18': 1, '16': 1, '15': 2, '14': 1, '11': 2}
{'09': 1, '18': 1, '16': 1, '15': 2, '14': 1, '11': 3}
(deleted portion of the result)
{'09': 2, '18': 1, '16': 1, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1}
{'09': 2, '18': 1, '16': 1, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 1}
{'09': 2, '18': 1, '16': 1, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 2}
{'09': 2, '18': 1, '16': 2, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 2}
{'09': 2, '18': 1, '16': 3, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 2}
{'09': 2, '18': 1, '16': 4, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 2}

有人可以帮助我哪里出错了吗？我正朝着正确的方向前进吗？感谢任何反馈和建议，我对编程很新，请对任何格式错误表示温和和抱歉。

Answer 1

删除print count，然后在循环的末尾添加以下行：

for key in sorted(count.keys()):
    print key, count[key]

Answer 2

由于datetime始终采用相同的格式，因此您可以使用虚拟方法：

your_string[-13:11] # your hour

其中your_string是您粘贴的那个，但是每个包含完整日期时间的文本都对此操作有效。

Answer 3

我认为如果你真的想要那个输出，而不是＆＃34; print count＆＃34;最后你需要（在循环之外）：

for a in sorted(count.keys()):
    print a,count[a]

Answer 4

你的问题是你正在打印字典，字典不是用Python排序的（实际上它们是，但不是它们的键值，所以这是一个没有实际意义的点。）

您可以通过在打印结果之前对字典键进行排序来解决此问题，如建议的那样。就个人而言，我不确定这是最好的解决方案。

原因是你正在处理数字。更重要的是，你正在处理[0,23]范围内的数字。这字面上尖叫着“使用清单！”对我来说。： - ）

所以不要使用dict（），而是尝试使用：

# count = dict()
count = [0] * 24

这将创建一个包含24个项目的列表，索引从0到23。

现在，您从字符串解析中获得的内容也是字符串，因此您需要将它们转换为数字：

# count[hour] = count.get(hour, 0) + 1
count[int(hour)] += 1

注意如何获得一个无法转换为整数或不属于0..23范围的小时将与dict一起使用但是在预先初始化的列表中失败。这实际上是好的：接收错误输入并使用它生成错误输出而不引起投诉的代码是糟糕的代码。当然，抛出异常的代码也不是很好的代码，但它是朝着正确方向迈出的一步。

当然，出现了另一个问题：如果你打印一个字典，它的键和值都会打印出来。如果打印列表，则仅打印值。所以我们需要将输出代码更改为：

for hour, amount in enumerate(count):
    print hour, ':', amount

下一点我想在你的代码中说明：你绝对确定你的电子邮件地址不包含空格吗？您的代码总是有可能遇到如下所示的行：

From: "Bob Fisher" <bob@fishers.org> Sat Jan 5 09:14:16 2015

基本上，你的字符串看起来像它的尾部具有更多的常规和可预测的格式。这意味着使用稍微不同的语法检索时间会更可靠：

# line = text[5] 
line = text[-2] # We take 2nd element from the end of string instead

使用正则表达式可能更通用，但这是一个更高级的主题，我将在这里发现：如果你知道正则表达式，你将能够轻松地做到，如果你不这样做，你应该通过适当的介绍而不是我能够在这里做什么来改善。

另一个挑剔：我注意到你没有关闭文件句柄。这不是一个大问题，因为你的程序无论如何终止，任何仍然打开的文件句柄将自动关闭。在较大的项目中，这可能会导致问题。您的代码可能会被其他代码调用，如果您的代码生成异常并且调用者处理或禁止此异常，则文件句柄将保持打开状态。重复一遍，程序将超过操作系统限制，以获得最大打开文件数。

所以我建议使用略有不同的语法来打开文件：

with open(name, 'r') as handle:
    for text in handle:
        # ...

此语法的优点是'with'将正确关闭文件句柄，无论下面的代码中发生什么。即使发生异常，文件仍将正确关闭。

到目前为止，代码看起来像是：

name = raw_input("Enter file:")
if not name: name = "mbox-short.txt" # cleaner check for empty string
count = [0] * 24 # use pre-initialized list instead of dict
with open(name, 'r') as handle: # use safer syntax to open files
    for text in handle:
        text = text.rstrip()
        if not text.startswith('From '): continue
        text = text.split()
        line = text[-2] # use 2nd item from the end, just to be safe
        time = line.split(':')
        hour = int(time[0]) # we treat hour as integer
        count[hour] += 1 # nicer looking
for hour, amount in enumerate(count):
    if amount: # Only print hours with non-zero counters
        print hour, ':', amount

现在，有办法将它的大小减少至少一半（可能更多），但我一直在努力保持一切简单和真实的原始代码的精神。

Answer 5

import re
import collections

name = raw_input("Enter file:")
if not name: name = "mbox-short.txt"

with open(name) as handle:
    hours = re.findall(r'^From .*(\d{2}):\d{2}:\d{2}', handle.read(), re.M)

count = sorted(collections.Counter(hours).items(), key=lambda x: int(x[0]))

for h, c in count:
    print h, c

Python 2无法获取键和值（字典和元组）

5 个答案: