我正在尝试构建一个应用程序,该应用程序从大学课程目录中抓取课程信息,然后构建学生可以选择的一些时间表。每次搜索新课程时,课程目录URL都不会更改,这就是为什么我要使用Selenium自动搜索课程目录,然后使用Beautiful Soup收集信息的原因。如果解决方案非常简单,这是我第一次使用Beautiful Soup和Selenium预先道歉。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests
URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)
response = requests.get(URL);
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find_all("tbody")
print(table)
当前,当我print(table)
时,它打印两个对象。一个,(第一张图片)具有关于该课程的一般信息(我不需要抓取的信息)。第二个对象为空。据我所知,网站上只有两个表格,两个都如下图所示。第二个是我感兴趣的一个,但是由于某种原因,table
中的第二个元素为空。
以下信息是我要抓取的信息。
print(table)
的输出
<tbody>
\n
<tr>
<th scope="row">Hours</th>
<td id="courseCredits"></td>
</tr>
\n
<tr>
<th scope="row">Prerequisites</th>
<td id="coursePrereqs"></td>
</tr>
\n
<tr>
<th scope="row">Recommended</th>
<td id="courseRec"></td>
</tr>
\n
<tr>
<th scope="row">Offered</th>
<td id="courseOffered"></td>
</tr>
\n
<tr>
<th scope="row">Headers</th>
<td id="courseHeaders"></td>
</tr>
\n
<tr>
<th scope="row">Note</th>
<td id="courseNote"></td>
</tr>
\n
<tr>
<th scope="row">When\xa0Taught</th>
<td id="courseWhenTaught"></td>
</tr>
\n
</tbody>
,
<tbody></tbody>
]
答案 0 :(得分:1)
这是一种解析表的技术:
AudioManager m_amAudioManager = (AudioManager)getSystemService(Context.AUDIO_SERVICE);
m_amAudioManager.setMode(AudioManager.MODE_RINGTONE | AudioManager.MODE_IN_CALL);
m_amAudioManager.setSpeakerphoneOn(true);
答案 1 :(得分:0)
仅使用硒,这非常简单:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)
# get table
table = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table[@id='sectionTable']")))
# iterate rows and cells
rows = table.find_elements_by_xpath("//tr")
for row in rows:
# get cells
cells = row.find_elements_by_tag_name("td")
# iterate cells
for cell in cells:
print(cell.text)
希望这可以帮助您入门。
答案 2 :(得分:0)
如果您只想使用requests
模块,而没有硒的解决方案,我就把它留在这里:
import json
import requests
url_classes = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getClasses.php'
url_sections = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getSections.php'
data_classes = {
'searchObject[yearterm]':'20195',
'searchObject[dept_name]':'C S',
'searchObject[catalog_number]':'142',
'sessionId':''
}
data_sections = {
'courseId':'',
'sessionId':'',
'yearterm':'20195',
'no_outcomes':'true'
}
classes = requests.post(url_classes, data=data_classes).json()
data_sections['courseId'] = next(iter(classes))
sections = requests.post(url_sections, data=data_sections).json()
# print(json.dumps(sections, indent=4)) # <-- uncomment this to see all data
# print(json.dumps(classes, indent=4))
for section in sections['sections']:
print(section)
print('-' * 80)
这将打印所有部分(但如果取消注释打印语句,则会有更多数据):
{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '001', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '001', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '0900', 'end_time': '0950', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '51', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------
{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '002', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '1000', 'end_time': '1050', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '34', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------
...and so on.