我目前正在抓取一个网站,以下面的格式提取营业时间:
"""Hours
Monday 9:30 AM - 9:00 PM
Tuesday 9:30 AM - 9:00 PM
Wednesday 9:30 AM - 9:00 PM
Thursday 9:30 AM - 9:00 PM
Friday 9:30 AM - 11:00 PM
Saturday 9:30 AM - 11:00 PM
Sunday 11:00 AM - 6:00 PM
Holiday Hours
Thanksgiving Day 11:00 AM - 6:00 PM"""
我想要处理它最终如此:
"""Mon-Thu 9:30AM-9:00PM
Fri-Sat 9:30AM-11:00PM
Sun & Hol 11:00AM-6:00PM"""
我很高兴为了学习和建立自己而采用提议的伪代码解决方案。我在这里无法解决任何问题。
答案 0 :(得分:3)
好吧,首先我们需要从这些文本块中解析(白天 - 开放时间 - 关闭时间)。正则表达任何人?
^(\w*)\s(\d{1,2}):(\d{1,2})\s(\w{2})\s-\s(\d{1,2}):(\d{1,2})\s(\w{2})
现在,我们需要将具有相同开放度的日子分组。关闭时间在一起。 defaultdict?
d = defaultdict(list)
for line in input_block:
# use regex to pull the components, inc day, opening time, closing time
# concat all the opening and closing times into a single string, as you want
d[opening_closing_time_str].append(day)
这是我开场时间的输出:
{
'09:30:00': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'],
'11:00:00': ['Sunday']
}
现在你可以迭代d
并按天分组开放时间,然后可能排序,所以星期一总是在顶部?而且你已经完成了:)
答案 1 :(得分:3)
我认为这是itertools.groupby()
的一个很好的用例 - 我们可以使用它将连续日分组到相同的时间范围。这些方面的东西:
from itertools import groupby
from operator import itemgetter
from pprint import pprint
data = """Hours
Monday 9:30 AM - 9:00 PM
Tuesday 9:30 AM - 9:00 PM
Wednesday 9:30 AM - 9:00 PM
Thursday 9:30 AM - 9:00 PM
Friday 9:30 AM - 11:00 PM
Saturday 9:30 AM - 11:00 PM
Sunday 11:00 AM - 6:00 PM
Holiday Hours
Thanksgiving Day 11:00 AM - 6:00 PM"""
# filter relevant rows with weekdays only
rows = [row.split(" ", 1) for row in data.splitlines()[1:-2]]
# group consecutive days by a time range
result = []
for time_range, group in groupby(rows, key=itemgetter(1)):
days_in_group = [item[0] for item in group]
first_day, last_day = days_in_group[0][:3], days_in_group[-1][:3]
range_end = "-" + str(last_day) if first_day != last_day else ""
result.append("{begin}{end} {time_range}".format(begin=first_day,
end=range_end,
time_range=time_range))
pprint(result)
打印:
['Mon-Thu 9:30 AM - 9:00 PM',
'Fri-Sat 9:30 AM - 11:00 PM',
'Sun 11:00 AM - 6:00 PM']
请注意,如果每一天都有不同的时间范围,这甚至会起作用。
答案 2 :(得分:1)