使用BeautifulSoup和Selenium从表中收集数据

时间:2019-12-30 23:37:13

标签: python selenium web-scraping beautifulsoup

我正在尝试构建一个应用程序,该应用程序从大学课程目录中抓取课程信息,然后构建学生可以选择的一些时间表。每次搜索新课程时,课程目录URL都不会更改,这就是为什么我要使用Selenium自动搜索课程目录,然后使用Beautiful Soup收集信息的原因。如果解决方案非常简单,这是我第一次使用Beautiful Soup和Selenium预先道歉。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests

URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)
response = requests.get(URL);
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find_all("tbody")
print(table)

当前,当我print(table)时,它打印两个对象。一个,(第一张图片)具有关于该课程的一般信息(我不需要抓取的信息)。第二个对象为空。据我所知,网站上只有两个表格,两个都如下图所示。第二个是我感兴趣的一个,但是由于某种原因,table中的第二个元素为空。

enter image description here

以下信息是我要抓取的信息。

This is the information I am trying to scrape

print(table)的输出

<tbody>
   \n
   <tr>
      <th scope="row">Hours</th>
      <td id="courseCredits"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Prerequisites</th>
      <td id="coursePrereqs"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Recommended</th>
      <td id="courseRec"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Offered</th>
      <td id="courseOffered"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Headers</th>
      <td id="courseHeaders"></td>
   </tr>
   \n
   <tr>
      <th scope="row">Note</th>
      <td id="courseNote"></td>
   </tr>
   \n
   <tr>
      <th scope="row">When\xa0Taught</th>
      <td id="courseWhenTaught"></td>
   </tr>
   \n
</tbody>
, 
<tbody></tbody>
]

3 个答案:

答案 0 :(得分:1)

这是一种解析表的技术:

AudioManager m_amAudioManager = (AudioManager)getSystemService(Context.AUDIO_SERVICE);

    m_amAudioManager.setMode(AudioManager.MODE_RINGTONE | AudioManager.MODE_IN_CALL);
    m_amAudioManager.setSpeakerphoneOn(true);

Run on Repl.it

答案 1 :(得分:0)

仅使用硒,这非常简单:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys

URL = "http://saasta.byu.edu/noauth/classSchedule/index.php"
driver = webdriver.Safari()
driver.get(URL)
element = driver.find_element_by_id("searchBar")
element.send_keys("C S 142", Keys.RETURN)

# get table
table = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//table[@id='sectionTable']")))

# iterate rows and cells
rows = table.find_elements_by_xpath("//tr")
for row in rows:

    # get cells
    cells = row.find_elements_by_tag_name("td")

    # iterate cells
    for cell in cells:
        print(cell.text)

希望这可以帮助您入门。

答案 2 :(得分:0)

如果您只想使用requests模块,而没有硒的解决方案,我就把它留在这里:

import json
import requests

url_classes = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getClasses.php'
url_sections = 'https://saasta.byu.edu/noauth/classSchedule/ajax/getSections.php'

data_classes = {
    'searchObject[yearterm]':'20195',
    'searchObject[dept_name]':'C S',
    'searchObject[catalog_number]':'142',
    'sessionId':''
}

data_sections = {
    'courseId':'',
    'sessionId':'',
    'yearterm':'20195',
    'no_outcomes':'true'
}

classes = requests.post(url_classes, data=data_classes).json()
data_sections['courseId'] = next(iter(classes))
sections = requests.post(url_sections, data=data_sections).json()

# print(json.dumps(sections, indent=4)) # <-- uncomment this to see all data
# print(json.dumps(classes, indent=4))

for section in sections['sections']:
    print(section)
    print('-' * 80)

这将打印所有部分(但如果取消注释打印语句,则会有更多数据):

{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '001', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '001', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '0900', 'end_time': '0950', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '51', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------
{'curriculum_id': '01489', 'title_code': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'section_number': '002', 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'section_type': 'DAY', 'credit_type': 'S', 'start_date': '2019-09-03', 'end_date': '2019-12-12', 'year_term': '20195', 'instructors': [{'person_id': '241223832', 'byu_id': '821566504', 'net_id': 'bretted', 'surname': 'Decker', 'sort_name': 'Decker, Brett E', 'rest_of_name': 'Brett E', 'preferred_first_name': 'Brett', 'phone_number': '801-380-4463', 'attribute_type': 'PRIMARY', 'year_term': '20195', 'curriculum_id': '01489', 'title_code': '002', 'section_number': '002', 'dept_name': 'C S', 'catalog_number': '142', 'catalog_suffix': None, 'fixed_or_variable': 'F', 'credit_hours': '3.00', 'minimum_credit_hours': '3.00', 'honors': None, 'credit_type': 'S', 'section_type': 'DAY'}], 'times': [{'begin_time': '1000', 'end_time': '1050', 'building': 'TMCB', 'room': '1170', 'sequence_number': '2', 'mon': 'M', 'tue': '', 'wed': 'W', 'thu': '', 'fri': 'F', 'sat': '', 'sun': ''}], 'headers': [], 'availability': {'seats_available': '34', 'class_size': '203', 'waitlist_size': '0'}}
--------------------------------------------------------------------------------

...and so on.