重构字符串

时间:2014-11-12 09:07:21

标签: python string python-2.7

需要重新构建或从字符串

中获取所需内容
msg = ['Check-in  Jones  Room  403 at 2014-11-02 05 20 (recorded)',
   'Check-out Beis  Room  302 at 2014-11-03 05 20 (recorded)']


for each_guest in msg:
    each_guest = each_guest.replace('  ', ' ')
    action, name, room, number, at, date, time0, time1, recorded = each_guest.split(' ')

    print name, action, number, date, time0 + ':' + time1

以上运行正常,输出为:

Jones Check-in 403 2014-11-02 05:20
Beis Check-out 302 2014-11-03 05:20

但是一旦情况发生变化,它就无效。例如,字符串更改为:

msg = ['Check-in  Mr. Benny Jones  Room  403 at 2014-11-02 05 20 (recorded)',
   'Check-out Mr. Ken Beis  Room  302 at 2014-11-03 05 20 (recorded)']

单词之间的空格数是不确定的。如何提取全名(包括标题),并将其作为成功样本显示的方式?

5 个答案:

答案 0 :(得分:3)

>>> msg = ['Check-in  Mr. Benny Jones  Room  403 at 2014-11-02 05 20 (recorded)',
...    'Check-out Mr. Ken Beis  Room  302 at 2014-11-03 05 20 (recorded)']
>>> for each_guest in msg:
...       m = re.match("([-_a-zA-Z]+)\s+(.*?)\s+Room\s+(\d+)\s+at\s+(\d+-\d+-\d+)\s+(\d+)\s+(\d+)\s+\(recorded\)", each_guest)
...       action, name, number, date, time0, time1 = (m.group(i) for i in range(1, 7))
...       print(action, name, number, date, time0, time1)
('Check-in', 'Mr. Benny Jones', '403', '2014-11-02', '05', '20')
('Check-out', 'Mr. Ken Beis', '302', '2014-11-03', '05', '20')

答案 1 :(得分:1)

假设您无法对源数据格式做任何事情(因此您可以获得更容易解析的内容),正则表达式就是您的朋友。您的消息字符串具有相当一致的模式:

"<action> <name> Room <room number> at <date> (recorded)"

action是“签入”或“签出”之一,name是自由文本,room number是一系列数字,date是“ YYYY-MM-DD HH MM“(很好地假设它就是这样)。我不会在这里写下它的确切正则表达式,但它相当简单。

答案 2 :(得分:1)

只要所有其他字段都有固定长度(没有额外的空格),这就有效:

msg = ['Check-in  Jones  Room  403 at 2014-11-02 05 20 (recorded)',
   'Check-out Beis  Room  302 at 2014-11-03 05 20 (recorded)',
   'Check-in  Mr. Benny Jones  Room  403 at 2014-11-02 05 20 (recorded)',
   'Check-out Mr. Ken Beis  Room  302 at 2014-11-03 05 20 (recorded)']


for each_guest in msg:
    each_guest = each_guest.replace('  ', ' ')
    splits = each_guest.split()
    action = splits[0]
    date_and_time = ' '.join(splits[-4:-2])
    roomnumber = splits[-6]
    name = ' '.join(splits[1:-7])

    print name, action, roomnumber, date_and_time

打印

Jones Check-in 403 2014-11-02 05
Beis Check-out 302 2014-11-03 05
Mr. Benny Jones Check-in 403 2014-11-02 05
Mr. Ken Beis Check-out 302 2014-11-03 05

答案 3 :(得分:1)

由于字符串具有固定结构,因此您可以使用正则表达式(最好使用命名组):

import re

r = re.compile(r'^(?P<check>[-\w]+)\s+(?P<name>.*(?=\s\sRoom))\s+Room\s+(?P<room>\d+)\s+at\s+(?P<date>\S+)\s(?P<hour>\d+)\s(?P<min>\d+)')
msg = ['Check-in  Mr. Benny Jones  Room  403 at 2014-11-02 05 20 (recorded)',
       'Check-out Mr. Ken Beis  Room  302 at 2014-11-03 05 20 (recorded)',
       'Check-in  Jones  Room  403 at 2014-11-02 05 20 (recorded)',
       'Check-out Beis  Room  302 at 2014-11-03 05 20 (recorded)']

for m in msg:
    groups = r.search(m).groupdict()
    print '{name} {check} {room} {date} {hour}:{min}'.format(**groups)

输出:

Mr. Benny Jones Check-in 403 2014-11-02 05:20
Mr. Ken Beis Check-out 302 2014-11-03 05:20
Jones Check-in 403 2014-11-02 05:20
Beis Check-out 302 2014-11-03 05:20

答案 4 :(得分:1)

我认为你想要string.index,这样的东西可以提取全名和其他信息:

# i purposefully not assigning names to different indexes so you're more clear on how it works
for each in msg:
    each = each.split()
    print each[0], \
      ' '.join(each[1:each.index('Room')]), \
      each[each.index('Room')+1], \
      each[each.index('at')+1], \
      ':'.join(each[each.index('at')+2:each.index('(recorded)')]) # date time

Check-in Mr. Benny Jones 403 2014-11-02 05:20
Check-out Mr. Ken Beis 302 2014-11-03 05:20