我有一个以下格式的事件表
<table class="events">
<tbody>
<tr>
<td"><span class="event_date">28.02.2018</span></td>
</tr>
<tr class="event">
<td class="event_time">18:00</td>
<td class="event_name">Event_1</td>
</tr>
<tr class="event">
<td class="event_time">19:00</td>
<td class="event_name">Event_2</td>
</tr>
<tr>
<td"><span class="event_date">01.03.2018</span></td>
</tr>
<tr class="event">
<td class="event_time">18:00</td>
<td class="event_name">Event_3</td>
</tr>
<tr class="event">
<td class="event_time">19:00</td>
<td class="event_name">Event_4</td>
</tr>
<tr class="event">
<td class="event_time">20:00</td>
<td class="event_name">Event_5</td>
</tr>
</tbody>
我可以使用以下命令轻松地为每个事件提取时间和名称
event_container = page_soup.findAll("tr", {"class":"event"})
for event in event_container:
event_name = event.find("td", {"class":"event_name"})
event_time = event.find("td", {"class":"event_time"})
但是我无法正确地将event_date分配给那些事件
通缉输出
姓名:Event_1,日期:28.02.2018,时间:18:00
姓名:Event_2,日期:28.02.2018,时间:19:00
姓名:Event_3,日期:01.03.2018,时间:18:00
姓名:Event_4,日期:01.03.2018,时间:19:00
姓名:Event_5,日期:01.03.2018,时间:20:00
感谢您的帮助
答案 0 :(得分:0)
可能性是刮掉所有想要的文本,然后在适当的事件日期下分组:
from bs4 import BeautifulSoup as soup
import itertools, re
def beautify(f):
def wrapper():
return ["Name: {name}, Date: {date}, Time: {time}".format(**dict(zip(['date', 'time', 'name'], i))) for i in f()]
return wrapper
@beautify
def raw_data():
s = soup(data, 'lxml')
final_data = [i.text for i in s.find_all('td')]
final_results = [list(b) for a, b in itertools.groupby(final_data, key=lambda x:bool(re.findall('\d+\.\d+\.\d+$', x)))]
new_final_data = [[a+b[i:i+2] for i in range(0, len(b), 2)] for a, b in [final_results[i:i+2] for i in range(0, len(final_results), 2)]]
return [i for b in new_final_data for i in b]
输出:
['Name: Event_1, Date: 28.02.2018, Time: 18:00', 'Name: Event_2, Date: 28.02.2018, Time: 19:00', 'Name: Event_3, Date: 01.03.2018, Time: 18:00', 'Name: Event_4, Date: 01.03.2018, Time: 19:00', 'Name: Event_5, Date: 01.03.2018, Time: 20:00']
答案 1 :(得分:0)
您可以做的一件事是迭代所有<tr>
标记,并检查它是否带有<span>
标记,其中包含日期与否。如果它有日期,请更新日期,否则获取名称和时间。
但是在Python中,它是EAFP(比请求更容易请求宽恕)。所以,你可以简单地使用它:
table = soup.find('table', class_='events')
event_date = ''
for row in table.find_all('tr'):
try:
event_date = row.td.span.text
continue
except AttributeError:
pass
event_name = row.find('td', class_='event_name').text
event_time = row.find('td', class_='event_time').text
print('Name: {}, Date: {}, Time: {}'.format(event_name, event_date, event_time))
输出:
Name: Event_1, Date: 28.02.2018, Time: 18:00
Name: Event_2, Date: 28.02.2018, Time: 19:00
Name: Event_3, Date: 01.03.2018, Time: 18:00
Name: Event_4, Date: 01.03.2018, Time: 19:00
Name: Event_5, Date: 01.03.2018, Time: 20:00