对字段之间具有不等数量空格的字符串进行标记

时间:2011-01-01 23:54:19

标签: python split tokenize

我是tryint来标记文件中的条目。但是由于文件之间的空格数不等,我无法使用line.split("")选项。我正在从下面的文件中复制几行:

"08-09-2010 21:21:46      00:22:7f:a6:9b:69                                 -79"
"08-09-2010 21:21:46      04:4f:aa:b4:49:49                                 -79"
"08-09-2010 21:21:46      04:4f:aa:31:4e:59   tikona 18002090044            -83"
"08-09-2010 21:21:46      00:22:7f:26:9b:69   tikona 18002090044            -74"
"08-09-2010 21:21:46      04:4f:aa:34:0d:c9   tikona 18002090044            -82"
"08-09-2010 21:21:46      04:4f:aa:71:4e:59                                 -85"
"08-09-2010 21:21:46      04:4f:aa:34:21:89   tikona 18002090044            -75"
"08-09-2010 21:21:46      04:4f:aa:34:49:49   tikona 18002090044            -77"
"08-09-2010 21:21:46      04:4f:aa:74:0d:c9                                 -85"
"08-09-2010 21:22:47      18 APs were seen
"

我需要访问第一列(datetime对象)第二列(00:22...)和最后一列(-79等)。我可以轻松访问第一列和第二列,但不能访问最后一列。当我执行info=line.spilt("")时,由于第三列可能或可能没有条目,我无法确定令牌编号。

如何访问第4列?有没有办法可以使用info[i].contains(" -")

4 个答案:

答案 0 :(得分:7)

列看起来是固定宽度的,在这种情况下,您可以使用字符串切片,然后使用可能的.strip()来删除尾随空格:

>>> for line in data.split('\n'):
...     print (line[1:25].strip(), line[26:45].strip(), line[46:69].strip(), line[70:-1].strip())
... 
('08-09-2010 21:21:46', '00:22:7f:a6:9b:69', '', '-79')
('08-09-2010 21:21:46', '04:4f:aa:b4:49:49', '', '-79')
('08-09-2010 21:21:46', '04:4f:aa:31:4e:59', 'tikona 18002090044', '-83')
('08-09-2010 21:21:46', '00:22:7f:26:9b:69', 'tikona 18002090044', '-74')
('08-09-2010 21:21:46', '04:4f:aa:34:0d:c9', 'tikona 18002090044', '-82')
('08-09-2010 21:21:46', '04:4f:aa:71:4e:59', '', '-85')
('08-09-2010 21:21:46', '04:4f:aa:34:21:89', 'tikona 18002090044', '-75')
('08-09-2010 21:21:46', '04:4f:aa:34:49:49', 'tikona 18002090044', '-77')
('08-09-2010 21:21:46', '04:4f:aa:74:0d:c9', '', '-85')
('08-09-2010 21:22:47', '18 APs were seen', '', '')
('', '', '', '')

('', '', '', '')来自最终输入行"

如果列不是固定宽度,那么您仍然可以使用.split()并使用索引-1获取 last 列。虽然你应该谨慎使用.split(),因为当“正确”完成时有点乱。我建议使用双空格作为分隔符来处理18 APs were seen情况,但请注意,这会更改第二列的索引。

>>> for line in data.split('\n'):
...     fields = line.split('  ')
...     print (fields[0], fields[3], fields[-1])
... 
('"08-09-2010 21:21:46', '00:22:7f:a6:9b:69', ' -79"')
('"08-09-2010 21:21:46', '04:4f:aa:b4:49:49', ' -79"')
('"08-09-2010 21:21:46', '04:4f:aa:31:4e:59', '-83"')
('"08-09-2010 21:21:46', '00:22:7f:26:9b:69', '-74"')
('"08-09-2010 21:21:46', '04:4f:aa:34:0d:c9', '-82"')
('"08-09-2010 21:21:46', '04:4f:aa:71:4e:59', ' -85"')
('"08-09-2010 21:21:46', '04:4f:aa:34:21:89', '-75"')
('"08-09-2010 21:21:46', '04:4f:aa:34:49:49', '-77"')
('"08-09-2010 21:21:46', '04:4f:aa:74:0d:c9', ' -85"')
('"08-09-2010 21:22:47', '18 APs were seen', '18 APs were seen')
('"08-09-2010 21:21:46', '00:22:7f:26:9b:69', '-74"')
Traceback (most recent call last):
  File "<input>", line 3, in <module>
IndexError: list index out of range

IndexError归因于您的上一个输入行。如果这是真正的输入,你应该捕获这个错误。

答案 1 :(得分:1)

您可以使用正则表达式

拆分它
#!/usr/bin/env python

import re

mac_data_re = re.compile(
    r'^(?P<date>[\d-]+)\s+' +
    r'(?P<time>[\d:]+)\s+' +
    r'(?P<mac>[\da-f:]+)\s+' +
    r'(?P<host>\w+){0,1}\s+' +
    r'(?P<host_id>\d+){0,1}\s+'
    r'(?P<final_number>-{0,1}\d+)$')

with file('list') as f:
    for line in (l.strip() for l in f):
        match = mac_data_re.match(line)
        if match:
            print "date={date}, time={time}, mac={mac}, host={host}, host_id={host_id} final_number={final_number}".format(**match.groupdict())
        else:
            print "Line not matched: '%s'" % line

这是输出,

 aid@bullet:~/tmp$ ./parse_list.py 
date=08-09-2010, time=21:21:46, mac=00:22:7f:a6:9b:69, host=None, host_id=None final_number=-79
date=08-09-2010, time=21:21:46, mac=04:4f:aa:b4:49:49, host=None, host_id=None final_number=-79
date=08-09-2010, time=21:21:46, mac=04:4f:aa:31:4e:59, host=tikona, host_id=18002090044 final_number=-83
date=08-09-2010, time=21:21:46, mac=00:22:7f:26:9b:69, host=tikona, host_id=18002090044 final_number=-74
date=08-09-2010, time=21:21:46, mac=04:4f:aa:34:0d:c9, host=tikona, host_id=18002090044 final_number=-82
date=08-09-2010, time=21:21:46, mac=04:4f:aa:71:4e:59, host=None, host_id=None final_number=-85
date=08-09-2010, time=21:21:46, mac=04:4f:aa:34:21:89, host=tikona, host_id=18002090044 final_number=-75
date=08-09-2010, time=21:21:46, mac=04:4f:aa:34:49:49, host=tikona, host_id=18002090044 final_number=-77
date=08-09-2010, time=21:21:46, mac=04:4f:aa:74:0d:c9, host=None, host_id=None final_number=-85
Line not matched: '08-09-2010 21:22:47      18 APs were seen'

答案 2 :(得分:0)

你可以rsplit获取最后一个值,例如“”.rsplit(“”,1)

答案 3 :(得分:0)

您是否可以控制写入该文件的代码?如果是这样,您可以将其更改为使用制表符分隔字段,然后在选项卡上拆分。这将保持一致的场分离。