Python:BeautifulSoup Scrape,课程的空白说明

时间:2018-11-26 23:51:50

标签: python web-scraping beautifulsoup

我正试图从站点https://bulletins.psu.edu/university-course-descriptions/undergraduate/上抓取一些课程数据。

# -*- coding: utf-8 -*-
"""
Created on Mon Nov  5 20:37:33 2018

@author: DazedFury
"""
# Here, we're just importing both Beautiful Soup and the Requests library
from bs4 import BeautifulSoup
import requests

# returns a CloudflareScraper instance
#scraper = cfscrape.create_scraper()  

#URL and textfile
text_file = open("Output.txt", "w", encoding='UTF-8')
page_link = 'https://bulletins.psu.edu/university-course-descriptions/undergraduate/acctg/'
page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")

#Array for storing URL's
URLArray = []

#Find links
for link in page_content.find_all('a'):
    if('/university-course-descriptions/undergraduate' in link.get('href')):
        URLArray.append(link.get('href'))
k = 1

#Parse Loop        
while(k != 242):
    print("Writing " + str(k))

    completeURL = 'https://bulletins.psu.edu' + URLArray[k]  

    # this is the url that we've already determined is safe and legal to scrape from.
    page_link = completeURL

    # here, we fetch the content from the url, using the requests library
    page_response = requests.get(page_link)

    #we use the html parser to parse the url content and store it in a variable.
    page_content = BeautifulSoup(page_response.content, "html.parser")
    page_content.prettify    

    #Find and print all text with tag p
    paragraphs = page_content.find_all('div', {'class' : 'course_codetitle'})
    paragraphs2 = page_content.find_all('div', {'class' : 'courseblockdesc'})
    j = 0
    for i in range(len(paragraphs)):
        if i % 2 == 0:
            text_file.write(paragraphs[i].get_text())
            text_file.write("\n")
            if j < len(paragraphs2):
                text_file.write(" ".join(paragraphs2[j].get_text().split()))
                text_file.write("\n")
                text_file.write("\n")
                if(paragraphs2[j].get_text() != ""):
                    j += 1

    k += 1

#FORMAT
#text_file.write("<p style=\"page-break-after: always;\">&nbsp;</p>")
#text_file.write("\n\n")

#Close Text File
text_file.close()

我需要的具体信息是课程名称和说明。问题在于某些课程的描述空白,从而使顺序混乱并提供错误的数据。

output.txt

bulletin

我考虑过只检查课程描述是否为空白,但是在网站上,如果课程没有描述,则不存在“ courseblockdesc”标记。因此,当我找到find_all courseblockdesc时,该列表实际上并没有向数组添加添加元素,因此顺序最终混乱了。有太多错误无法手动修复,因此我希望有人可以帮助我找到解决方案。

2 个答案:

答案 0 :(得分:1)

最简单的解决方案是在一个find_all中逐一查找您要查找的项的父项。

for block in page_content.find_all('div', class_="courseblock"):
    title = block.find('div', {'class' : 'course_codetitle'})
    description = block.find('div', {'class' : 'courseblockdesc'})
    #  do what you need with the navigable strings here.
    print(title.get_text()
    if description:
       print(description.get_text())

答案 1 :(得分:1)

您可能会使该过程有些复杂,但是您肯定在正确的轨道上。而不是将信息存储在数组中并依靠所有索引来排队,而是在遍历课程时编写文本文件,从每个课程模块动态提取标题和说明。如果没有说明,您可以当场处理。这是一个工作示例:

from bs4 import BeautifulSoup
import requests

url = "https://bulletins.psu.edu/university-course-descriptions/undergraduate/acctg/"

with open("out.txt", "w", encoding="UTF-8") as f:
    for link in BeautifulSoup(requests.get(url).content, "html.parser").find_all("a"):
        if "/university-course-descriptions/undergraduate" in link["href"]:
            soup = BeautifulSoup(requests.get("https://bulletins.psu.edu" + link["href"]).content, "html.parser")

            for course in soup.find_all("div", {"class": "courseblock"}):
                title = course.find("div", {"class" : "course_title"}).get_text().strip()

                try:
                    desc = course.find("div", {"class" : "courseblockdesc"}).get_text().strip()
                except AttributeError:
                    desc = "No description available"

                f.write(title + "\n" + desc + "\n\n")

输出代码段(从文本文件末尾以验证对齐方式):

WLED 495: **SPECIAL TOPICS**
No description available

WLED 495B: Field Experience for World Languages Teacher Preparation in Grades 1-5
WL ED 495B Field Experience for World Languages Teacher Preparation in Grades 1-5 (3) Practicum situation where Prospective World Language teachers will demonstrate acquired knowledge on second language learning/teaching and educational theories. Prospective World Language teachers will have assigned school placements and will attend a weekly seminar where issues in World Language learning and teaching will be discussed. At their assigned school placement, prospective World Language teachers will have many opportunities to observe/work with children in grades 1-5 (1) focusing on second language learning/teaching and the socio/cultural issues associated to classroom practices while implementing and self-evaluated own designed activities and lessons; (2) weekly seminars will engage students in reflective activities that will enable them to analyze each week's events; (3) inquiry projects on teaching and learning of World Languages.

WLED 495C: Field Experience for World Languages Teacher Preparation in Grades 6-12
WL ED 495C Field Experience for World Languages Teacher Preparation in Grades 6-12 (3) Practicum situation where prospective World Language teachers will demonstrate acquired knowledge on second language learning/teaching and educational theories. Prospective World Language teachers will have assigned school placements in grades 6-12 and will attend a weekly seminar where issues in World Language learning and teaching will be discussed. At their assigned school placement, prospective World Language teachers will have many opportunities to observe/work with students in grades 6-12 (1) focusing on second language learning/teaching and the socio/cultural issues associated to classroom practices while implementing and self-evaluating their own designed activities and lessons, (2) weekly seminars will engage students in reflective activities that will enable them to analyze each week's events, and (3) inquiry projects on teaching and learning of World Languages.

其他次要备注:

  • 最好将with关键字用于文件I / O。完成后,这将自动关闭文件句柄。

  • 详细的中间变量和注释会增加噪音,例如:

# Here, we're just importing both Beautiful Soup and the Requests library
from bs4 import BeautifulSoup

#Close Text File
text_file.close()

始终可以删除,从而使程序逻辑更易于遵循。