我正在尝试收集活动日期,时间和地点。他们成功地出来了,但是那不是读者友好的。如何获取日期,时间和地点分别显示如下:
- event
Date:
Time:
Venue:
- event
Date:
Time:
Venue:
我本来打算拆分,但最终得到了很多[],看起来更加难看。我想到剥离,但是我的正则表达式似乎没有任何作用。有什么建议吗?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"
response = urllib.request.urlopen(url_toscrape)
info_type = response.info()
responseData = response.read()
soup = BeautifulSoup(responseData, 'lxml')
events_absFirst = soup.find_all("div",{"class": "ntu_event_summary_title_first"})
date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})
events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})
for first in events_absFirst:
print('-',first.text.strip())
print (' ',date)
for tr in soup.find_all("div",{"class":"ntu_event_detail"}):
date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})
events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})
for events in events_absAll:
events = events.text.strip()
for date in date_absAll:
date = date.text.strip('^Time.*')
print ('-',events)
print (' ',date)
答案 0 :(得分:0)
您可以使用请求并测试stripped_strings的长度
import requests
from bs4 import BeautifulSoup
import pandas as pd
url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"
response = requests.get(url_toscrape)
soup = BeautifulSoup(response.content, 'lxml')
events = [item.text for item in soup.select("[class^='ntu_event_summary_title']")]
data = soup.select('.ntu_event_summary_date')
dates = []
times = []
venues = []
for item in data:
strings = [string for string in item.stripped_strings]
if len(strings) == 3:
dates.append(strings[0])
times.append(strings[1])
venues.append(strings[2])
elif len(strings) == 2:
dates.append(strings[0])
times.append(strings[1])
venues.append('N/A')
elif len(strings) == 1:
dates.append(strings[0])
times.append('N/A')
venues.append('N/A')
results = list(zip(events, dates, times, venues))
df = pd.DataFrame(results)
print(df)
答案 1 :(得分:0)
您可以遍历包含事件信息的div
,存储结果,然后打印每个:
import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.ntu.edu.sg/events/Pages/default.aspx').text, 'html.parser')
results = [[getattr(i.find('div', {'class':re.compile('ntu_event_summary_title_first|ntu_event_summary_title')}), 'text', 'N/A'), getattr(i.find('div', {'class':'ntu_event_summary_detail'}), 'text', 'N/A')] for i in d.find_all('div', {'class':'ntu_event_articles'})]
new_results = [[a, re.findall('Date : .*?(?=\sTime)|Time : .*?(?=Venue)|Time : .*?(?=$)|Venue: [\w\W]+', b)] for a, b in results]
print('\n\n'.join('-{}\n{}'.format(a, '\n'.join(f' {h}:{i}' for h, i in zip(['Date', 'Time', 'Venue'], b))) for a, b in new_results))
输出:
-7th ASEF Rectors' Conference and Students' Forum (ARC7)
Date:Date : 29 Nov 2018 to 14 May 2019
Time:Time : 9:00am to 5:00pm
-Be a Youth Corps Leader
Date:Date : 1 Dec 2018 to 31 Mar 2019
Time:Time : 9:00am to 5:00pm
-NIE Visiting Artist Programme January 2019
Date:Date : 14 Jan 2019 to 11 Apr 2019
Time:Time : 9:00am to 8:00pm
Venue:Venue: NIE Art gallery
-Exercise Classes for You: Healthy Campus@NTU
Date:Date : 21 Jan 2019 to 18 Apr 2019
Time:Time : 6:00pm to 7:00pm
Venue:Venue: The Wave @ Sports & Recreation Centre
-[eLearning Course] Information & Media Literacy (From January 2019)
Date:Date : 23 Jan 2019 to 31 May 2019
Time:Time : 9:00am to 5:00pm
Venue:Venue: NTULearn
...