使用多个拆分选择文本

时间:2016-09-26 00:21:29

标签: python split

我已经开始学习python,而且我一直在处理有关操作文本数据的任务。我需要操作的文本行的一个例子:

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008

我需要从每一行中提取小时数(在本例中为09),然后找出发送电子邮件的最常见时间。

基本上,我需要做的是构建一个for循环,用冒号

分割每个文本
split(':')

然后按空格分割:

split()

我已经尝试了几个小时,但似乎无法弄明白。到目前为止我的代码是什么样的:

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
counts = dict()
lst = list()
temp = list()
for line in handle:
    if not "From " in line: continue
    words = line.split(':')  
    for word in words:
        counts[word] = counts.get(word,0) + 1

for key, val in counts.items():
    lst.append( (val, key) )
lst.sort(reverse = True)

for val, key in lst:
print key, val

上面的代码只进行了1次拆分,但我一直在尝试多种方法再次拆分文本。我不断收到列表属性错误,说&#34;列表对象没有属性拆分&#34;。非常感谢任何帮助。再次感谢

2 个答案:

答案 0 :(得分:1)

首先,

import re

然后替换

words = line.split(':')  
for word in words:
    counts[word] = counts.get(word,0) + 1

通过

line = re.search("[0-9]{2}:[0-9]{2}:[0-9]{2}", line).group(0)
words = line.split(':')
hour = words[0]
counts[hour] = counts.get(hour, 0) + 1

输入:

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 12:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 15:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 12:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 13:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 12:14:16 2008

输出:

09 4
12 3
15 1
13 1

答案 1 :(得分:1)

使用与Marcel Jacques Machado相同的测试文件:

>>> from collections import Counter
>>> Counter(line.split(' ')[-2].split(':')[0] for line in open('input')).items()
[('12', 3), ('09', 4), ('15', 1), ('13', 1)]

这表明09发生了4次而13只发生一次。

如果我们想要更漂亮的输出,我们可以做一些格式化。这显示了从最常见到最不常见的小时数及其计数:

>>> print('\n'.join('{} {}'.format(hh, n) for hh,n in Counter(line.split(' ')[-2].split(':')[0] for line in open('input')).most_common()))
09 4
12 3
15 1
13 1