Question

需要重新构建或从字符串

中获取所需内容

msg = ['Check-in  Jones  Room  403 at 2014-11-02 05 20 (recorded)',
   'Check-out Beis  Room  302 at 2014-11-03 05 20 (recorded)']


for each_guest in msg:
    each_guest = each_guest.replace('  ', ' ')
    action, name, room, number, at, date, time0, time1, recorded = each_guest.split(' ')

    print name, action, number, date, time0 + ':' + time1

以上运行正常，输出为：

Jones Check-in 403 2014-11-02 05:20
Beis Check-out 302 2014-11-03 05:20

但是一旦情况发生变化，它就无效。例如，字符串更改为：

msg = ['Check-in  Mr. Benny Jones  Room  403 at 2014-11-02 05 20 (recorded)',
   'Check-out Mr. Ken Beis  Room  302 at 2014-11-03 05 20 (recorded)']

单词之间的空格数是不确定的。如何提取全名（包括标题），并将其作为成功样本显示的方式？

Answer 1

>>> msg = ['Check-in  Mr. Benny Jones  Room  403 at 2014-11-02 05 20 (recorded)',
...    'Check-out Mr. Ken Beis  Room  302 at 2014-11-03 05 20 (recorded)']
>>> for each_guest in msg:
...       m = re.match("([-_a-zA-Z]+)\s+(.*?)\s+Room\s+(\d+)\s+at\s+(\d+-\d+-\d+)\s+(\d+)\s+(\d+)\s+\(recorded\)", each_guest)
...       action, name, number, date, time0, time1 = (m.group(i) for i in range(1, 7))
...       print(action, name, number, date, time0, time1)
('Check-in', 'Mr. Benny Jones', '403', '2014-11-02', '05', '20')
('Check-out', 'Mr. Ken Beis', '302', '2014-11-03', '05', '20')

Answer 2

假设您无法对源数据格式做任何事情（因此您可以获得更容易解析的内容），正则表达式就是您的朋友。您的消息字符串具有相当一致的模式：

"<action> <name> Room <room number> at <date> (recorded)"

action是“签入”或“签出”之一，name是自由文本，room number是一系列数字，date是“ YYYY-MM-DD HH MM“（很好地假设它就是这样）。我不会在这里写下它的确切正则表达式，但它相当简单。

Answer 3

只要所有其他字段都有固定长度（没有额外的空格），这就有效：

msg = ['Check-in  Jones  Room  403 at 2014-11-02 05 20 (recorded)',
   'Check-out Beis  Room  302 at 2014-11-03 05 20 (recorded)',
   'Check-in  Mr. Benny Jones  Room  403 at 2014-11-02 05 20 (recorded)',
   'Check-out Mr. Ken Beis  Room  302 at 2014-11-03 05 20 (recorded)']


for each_guest in msg:
    each_guest = each_guest.replace('  ', ' ')
    splits = each_guest.split()
    action = splits[0]
    date_and_time = ' '.join(splits[-4:-2])
    roomnumber = splits[-6]
    name = ' '.join(splits[1:-7])

    print name, action, roomnumber, date_and_time

打印

Jones Check-in 403 2014-11-02 05
Beis Check-out 302 2014-11-03 05
Mr. Benny Jones Check-in 403 2014-11-02 05
Mr. Ken Beis Check-out 302 2014-11-03 05

Answer 4

由于字符串具有固定结构，因此您可以使用正则表达式（最好使用命名组）：

import re

r = re.compile(r'^(?P<check>[-\w]+)\s+(?P<name>.*(?=\s\sRoom))\s+Room\s+(?P<room>\d+)\s+at\s+(?P<date>\S+)\s(?P<hour>\d+)\s(?P<min>\d+)')
msg = ['Check-in  Mr. Benny Jones  Room  403 at 2014-11-02 05 20 (recorded)',
       'Check-out Mr. Ken Beis  Room  302 at 2014-11-03 05 20 (recorded)',
       'Check-in  Jones  Room  403 at 2014-11-02 05 20 (recorded)',
       'Check-out Beis  Room  302 at 2014-11-03 05 20 (recorded)']

for m in msg:
    groups = r.search(m).groupdict()
    print '{name} {check} {room} {date} {hour}:{min}'.format(**groups)

输出：

Mr. Benny Jones Check-in 403 2014-11-02 05:20
Mr. Ken Beis Check-out 302 2014-11-03 05:20
Jones Check-in 403 2014-11-02 05:20
Beis Check-out 302 2014-11-03 05:20

Answer 5

我认为你想要string.index，这样的东西可以提取全名和其他信息：

# i purposefully not assigning names to different indexes so you're more clear on how it works
for each in msg:
    each = each.split()
    print each[0], \
      ' '.join(each[1:each.index('Room')]), \
      each[each.index('Room')+1], \
      each[each.index('at')+1], \
      ':'.join(each[each.index('at')+2:each.index('(recorded)')]) # date time

Check-in Mr. Benny Jones 403 2014-11-02 05:20
Check-out Mr. Ken Beis 302 2014-11-03 05:20

重构字符串

5 个答案: