我已经开始学习python,而且我一直在处理有关操作文本数据的任务。我需要操作的文本行的一个例子:
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
我需要从每一行中提取小时数(在本例中为09),然后找出发送电子邮件的最常见时间。
基本上,我需要做的是构建一个for循环,用冒号
分割每个文本split(':')
然后按空格分割:
split()
我已经尝试了几个小时,但似乎无法弄明白。到目前为止我的代码是什么样的:
name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
counts = dict()
lst = list()
temp = list()
for line in handle:
if not "From " in line: continue
words = line.split(':')
for word in words:
counts[word] = counts.get(word,0) + 1
for key, val in counts.items():
lst.append( (val, key) )
lst.sort(reverse = True)
for val, key in lst:
print key, val
上面的代码只进行了1次拆分,但我一直在尝试多种方法再次拆分文本。我不断收到列表属性错误,说&#34;列表对象没有属性拆分&#34;。非常感谢任何帮助。再次感谢
答案 0 :(得分:1)
首先,
import re
然后替换
words = line.split(':')
for word in words:
counts[word] = counts.get(word,0) + 1
通过
line = re.search("[0-9]{2}:[0-9]{2}:[0-9]{2}", line).group(0)
words = line.split(':')
hour = words[0]
counts[hour] = counts.get(hour, 0) + 1
输入:
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan 5 12:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan 5 15:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan 5 12:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan 5 13:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan 5 12:14:16 2008
输出:
09 4
12 3
15 1
13 1
答案 1 :(得分:1)
使用与Marcel Jacques Machado相同的测试文件:
>>> from collections import Counter
>>> Counter(line.split(' ')[-2].split(':')[0] for line in open('input')).items()
[('12', 3), ('09', 4), ('15', 1), ('13', 1)]
这表明09
发生了4次而13
只发生一次。
如果我们想要更漂亮的输出,我们可以做一些格式化。这显示了从最常见到最不常见的小时数及其计数:
>>> print('\n'.join('{} {}'.format(hh, n) for hh,n in Counter(line.split(' ')[-2].split(':')[0] for line in open('input')).most_common()))
09 4
12 3
15 1
13 1