我应该使用“美丽汤4”作为练习从我学校的网站上获取课程信息。过去几天我一直在此工作,但是我的代码仍然无法正常工作。
我问用户的第一件事是导入课程目录缩写。例如,ICS缩写为计算机科学信息。美丽的汤4应该列出所有课程以及有多少学生报名。
虽然我可以使输入部分正常工作,但仍然有错误或程序刚刚停止。
问题:Beautiful Soup是否可以接受用户输入,以便当用户输入ICS时,输出将是与ICS相关的所有课程的列表?
这是代码和我的尝试:
from bs4 import BeautifulSoup
import requests
import re
#get input for course
course = input('Enter the course:')
#Here is the page link
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get request and response
page_response = requests.get(BASE_AVAILABILITY_URL)
#getting Beautiful Soup to gather the html content
page_content = BeautifulSoup(page_response.content, 'html.parser')
#getting course information
main = page_content.find_all(class_='parent clearfix')
main_p = "".join(str (x) for x in main)
#get the course anchor tags
main_q = BeautifulSoup(main_p, "html.parser")
courses = main.find('a', href = True)
#get each course name
#empty dictionary for course list
courses_list = []
for a in courses:
courses_list.append(a.text)
search = input('Enter the course title:')
for course in courses_list:
if re.search(search, course, re.IGNORECASE):
print(course)
这是Juypter Notebook中提供的原始代码
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text)
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
奇怪的是,如果用户保存html文件,然后将其上传到Juypter Notebook,然后打开要读取的文件,则会显示课程。但是,对于此任务,用户无法保存文件,并且它必须是直接输入才能获得输出。
答案 0 :(得分:1)
您的代码存在问题,是page_content.find_all(class_='parent clearfix')
重新调整和空列表[]
。这就是您需要更改的第一件事。查看html,您将要查找<table>
,<tr>
,<td>
,标签
根据原始代码提供的内容,您只需更改一些内容即可进行逻辑处理:
我会指出我所做的更改:
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text) #<-- need to get the html text before creating a bs4 object. So I move the request (line below) before this, and also adjusted the parameter for this function.
# the rest of the code is fine
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
这将为您提供:
import requests, bs4
BASE_AVAILABILITY_URL = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s="
#get input for course
course = input('Enter the course:')
url = BASE_AVAILABILITY_URL + course
def scrape_availability(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html.parser')
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
scrape_availability(url)