我正在尝试让我的程序从网站上收集并打印事件的标题。我的代码的问题在于,它所打印的内容比事件的标题还多。它还提供了超链接。如何摆脱超链接?
from urllib.request import urlopen
from bs4 import BeautifulSoup
url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"
response = urllib.request.urlopen(url_toscrape)
info_type = response.info()
responseData = response.read()
soup = BeautifulSoup(responseData, 'lxml')
events_absAll = soup.find_all("div",{"class": "ntu_event_summary_title_first"})
for events in events_absAll:
if len(events.text) > 0:
print(events.text.strip())
print(events_absAll)
此外,如何获取for循环以继续重复,以便获得事件的完整列表,例如下面的列表?
-7th ASEF Rectors' Conference and Students' Forum (ARC7)
-Be a Youth Corps Leader
-NIE Visiting Artist Programme January 2019
- Exercise Classes for You: Healthy Campus@NTU
-[eLearning Course] Information & Media Literacy (From January 2019)
提前谢谢
答案 0 :(得分:0)
从评论继续:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"
response = urllib.request.urlopen(url_toscrape)
info_type = response.info()
responseData = response.read()
soup = BeautifulSoup(responseData, 'lxml')
events_absFirst = soup.find_all("div",{"class": "ntu_event_summary_title_first"})
events_absAll = soup.find_all("div",{"class": "ntu_event_summary_title"})
for first in events_absFirst:
print(first.text.strip())
for events in events_absAll:
print(events.text.strip())
OR(甚至更好):
使用类ntu_event_detail
并在其中找到a
:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.ntu.edu.sg/events/Pages/default.aspx")
soup = BeautifulSoup(page.content, 'html.parser')
events_absAll = soup.find_all("div",{"class": "ntu_event_detail"})
for events in events_absAll:
for a in events.find_all('a'):
print(a.text.strip())
输出:
7th ASEF Rectors' Conference and Students' Forum (ARC7)
Be a Youth Corps Leader
NIE Visiting Artist Programme January 2019
Exercise Classes for You: Healthy Campus@NTU
[eLearning Course] Information & Media Literacy (From January 2019)
[Workshop] Introduction to Zotero (Jan to Apr 2019)
[Workshop] Introduction to Mendeley (Jan to Apr 2019)
Sembcorp
Marine Green Wave Environmental Care Competition 2019 - Submit by 31 March 2019
[Consultation] Consultation for EndNote-Mac Users (Jan to Apr 2019)
The World Asian Business Case Competition, WACC 2019 at Seoul (proposal submission by 01 April 2019)
Heartware Network
.
.
.
编辑:
更好的方法是创建一个list
,将结果存储在其中,过滤空字符串(如果有):
data =[]
for events in events_absAll:
for a in events.find_all('a'):
data.append(a.text)
filtered = list(filter(None, data)) # fastest
for elem in filtered: print(elem)
答案 1 :(得分:0)
您可以将{= 1}}(以...开头)运算符与attribute = value选择器一起使用,以每个标题的class属性的开头部分为目标
^
答案 2 :(得分:0)
非常感谢您的帮助。我现在有另一个问题。我正在尝试收集活动日期,时间和地点。他们成功地出来了,但是那不是读者友好的。如何获取日期,时间和地点分别显示如下:
- event
Date:
Time:
Venue:
我本来打算拆分,但最终得到了很多[],这使它看起来更加难看。我想到剥离,但是我的正则表达式似乎没有任何作用。有什么建议吗?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"
response = urllib.request.urlopen(url_toscrape)
info_type = response.info()
responseData = response.read()
soup = BeautifulSoup(responseData, 'lxml')
events_absFirst = soup.find_all("div",{"class": "ntu_event_summary_title_first"})
date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})
events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})
for first in events_absFirst:
print('-',first.text.strip())
print (' ',date)
for tr in soup.find_all("div",{"class":"ntu_event_detail"}):
date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})
events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})
for events in events_absAll:
events = events.text.strip()
for date in date_absAll:
date = date.text.strip('^Time.*')
print ('-',events)
print (' ',date)