Question

我已经开始学习python，而且我一直在处理有关操作文本数据的任务。我需要操作的文本行的一个例子：

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008

我需要从每一行中提取小时数（在本例中为09），然后找出发送电子邮件的最常见时间。

基本上，我需要做的是构建一个for循环，用冒号

分割每个文本

split(':')

然后按空格分割：

split()

我已经尝试了几个小时，但似乎无法弄明白。到目前为止我的代码是什么样的：

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
counts = dict()
lst = list()
temp = list()
for line in handle:
    if not "From " in line: continue
    words = line.split(':')  
    for word in words:
        counts[word] = counts.get(word,0) + 1

for key, val in counts.items():
    lst.append( (val, key) )
lst.sort(reverse = True)

for val, key in lst:
print key, val

上面的代码只进行了1次拆分，但我一直在尝试多种方法再次拆分文本。我不断收到列表属性错误，说＆＃34;列表对象没有属性拆分＆＃34;。非常感谢任何帮助。再次感谢

Answer 1

首先，

import re

然后替换

words = line.split(':')  
for word in words:
    counts[word] = counts.get(word,0) + 1

通过

line = re.search("[0-9]{2}:[0-9]{2}:[0-9]{2}", line).group(0)
words = line.split(':')
hour = words[0]
counts[hour] = counts.get(hour, 0) + 1

输入：

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 12:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 15:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 12:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 13:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 12:14:16 2008

输出：

Answer 2

使用与Marcel Jacques Machado相同的测试文件：

>>> from collections import Counter
>>> Counter(line.split(' ')[-2].split(':')[0] for line in open('input')).items()
[('12', 3), ('09', 4), ('15', 1), ('13', 1)]

这表明09发生了4次而13只发生一次。

如果我们想要更漂亮的输出，我们可以做一些格式化。这显示了从最常见到最不常见的小时数及其计数：

>>> print('\n'.join('{} {}'.format(hh, n) for hh,n in Counter(line.split(' ')[-2].split(':')[0] for line in open('input')).most_common()))
09 4
12 3
15 1
13 1

使用多个拆分选择文本

2 个答案: