需要重新构建或从字符串
中获取所需内容msg = ['Check-in Jones Room 403 at 2014-11-02 05 20 (recorded)',
'Check-out Beis Room 302 at 2014-11-03 05 20 (recorded)']
for each_guest in msg:
each_guest = each_guest.replace(' ', ' ')
action, name, room, number, at, date, time0, time1, recorded = each_guest.split(' ')
print name, action, number, date, time0 + ':' + time1
以上运行正常,输出为:
Jones Check-in 403 2014-11-02 05:20
Beis Check-out 302 2014-11-03 05:20
但是一旦情况发生变化,它就无效。例如,字符串更改为:
msg = ['Check-in Mr. Benny Jones Room 403 at 2014-11-02 05 20 (recorded)',
'Check-out Mr. Ken Beis Room 302 at 2014-11-03 05 20 (recorded)']
单词之间的空格数是不确定的。如何提取全名(包括标题),并将其作为成功样本显示的方式?
答案 0 :(得分:3)
>>> msg = ['Check-in Mr. Benny Jones Room 403 at 2014-11-02 05 20 (recorded)',
... 'Check-out Mr. Ken Beis Room 302 at 2014-11-03 05 20 (recorded)']
>>> for each_guest in msg:
... m = re.match("([-_a-zA-Z]+)\s+(.*?)\s+Room\s+(\d+)\s+at\s+(\d+-\d+-\d+)\s+(\d+)\s+(\d+)\s+\(recorded\)", each_guest)
... action, name, number, date, time0, time1 = (m.group(i) for i in range(1, 7))
... print(action, name, number, date, time0, time1)
('Check-in', 'Mr. Benny Jones', '403', '2014-11-02', '05', '20')
('Check-out', 'Mr. Ken Beis', '302', '2014-11-03', '05', '20')
答案 1 :(得分:1)
假设您无法对源数据格式做任何事情(因此您可以获得更容易解析的内容),正则表达式就是您的朋友。您的消息字符串具有相当一致的模式:
"<action> <name> Room <room number> at <date> (recorded)"
action
是“签入”或“签出”之一,name
是自由文本,room number
是一系列数字,date
是“ YYYY-MM-DD HH MM“(很好地假设它就是这样)。我不会在这里写下它的确切正则表达式,但它相当简单。
答案 2 :(得分:1)
只要所有其他字段都有固定长度(没有额外的空格),这就有效:
msg = ['Check-in Jones Room 403 at 2014-11-02 05 20 (recorded)',
'Check-out Beis Room 302 at 2014-11-03 05 20 (recorded)',
'Check-in Mr. Benny Jones Room 403 at 2014-11-02 05 20 (recorded)',
'Check-out Mr. Ken Beis Room 302 at 2014-11-03 05 20 (recorded)']
for each_guest in msg:
each_guest = each_guest.replace(' ', ' ')
splits = each_guest.split()
action = splits[0]
date_and_time = ' '.join(splits[-4:-2])
roomnumber = splits[-6]
name = ' '.join(splits[1:-7])
print name, action, roomnumber, date_and_time
打印
Jones Check-in 403 2014-11-02 05
Beis Check-out 302 2014-11-03 05
Mr. Benny Jones Check-in 403 2014-11-02 05
Mr. Ken Beis Check-out 302 2014-11-03 05
答案 3 :(得分:1)
由于字符串具有固定结构,因此您可以使用正则表达式(最好使用命名组):
import re
r = re.compile(r'^(?P<check>[-\w]+)\s+(?P<name>.*(?=\s\sRoom))\s+Room\s+(?P<room>\d+)\s+at\s+(?P<date>\S+)\s(?P<hour>\d+)\s(?P<min>\d+)')
msg = ['Check-in Mr. Benny Jones Room 403 at 2014-11-02 05 20 (recorded)',
'Check-out Mr. Ken Beis Room 302 at 2014-11-03 05 20 (recorded)',
'Check-in Jones Room 403 at 2014-11-02 05 20 (recorded)',
'Check-out Beis Room 302 at 2014-11-03 05 20 (recorded)']
for m in msg:
groups = r.search(m).groupdict()
print '{name} {check} {room} {date} {hour}:{min}'.format(**groups)
输出:
Mr. Benny Jones Check-in 403 2014-11-02 05:20
Mr. Ken Beis Check-out 302 2014-11-03 05:20
Jones Check-in 403 2014-11-02 05:20
Beis Check-out 302 2014-11-03 05:20
答案 4 :(得分:1)
我认为你想要string.index
,这样的东西可以提取全名和其他信息:
# i purposefully not assigning names to different indexes so you're more clear on how it works
for each in msg:
each = each.split()
print each[0], \
' '.join(each[1:each.index('Room')]), \
each[each.index('Room')+1], \
each[each.index('at')+1], \
':'.join(each[each.index('at')+2:each.index('(recorded)')]) # date time
Check-in Mr. Benny Jones 403 2014-11-02 05:20
Check-out Mr. Ken Beis 302 2014-11-03 05:20