硒-网页抓取;如何使用硒获取特定标签?

时间:2020-10-11 16:55:01

标签: python selenium web-scraping beautifulsoup

我正在从大学网站上取消不同的课程。

该网站部分的HTML为:

<div>
<h2>About the programme</h2>
<p>The National&nbsp;Joint&nbsp;PhD Programme in Nautical Operations&nbsp;is organised as a joint degree between the following four national higher education institutions offering professional maritime education:</p>
<ul>
    <li>Universtity of Troms&oslash; - The Arctic University of Norway (UiT)</li>
    <li>University of&nbsp;South-Eastern&nbsp;Norway (USN)</li>
    <li>Western Norway University of Applied Sciences (HVL)</li>
    <li>Norwegian University of Science and Technology (NTNU)</li>
</ul>
<p>
    The National&nbsp;Joint&nbsp;PhD Programme in Nautical Operations will educate qualified candidates for research, teaching, dissemination and innovation work, and other activities requiring scientific insight and an operational
    maritime focus.&nbsp;
</p>
<p>
    Implementation of complex nautical operations today requires interdisciplinarity and differentiated competence, including research expertise, for the safe and efficient planning, implementation and evaluation of nautical
    operations.&nbsp;
</p>
<p>The programme has the following&nbsp;vision: to create an internationally recognized national PhD degree in nautical operations.</p>
<p>This vision will be achieved through the following overall objectives:</p>
<ol>
    <li>Strengthen the multidisciplinary national expertise in nautical operations through collaboration between the four higher education institutions in Norway with professional maritime education.</li>
    <li>The PhD Programme in Nautical Operations is the preferred Programme in the field and attracts good applicants nationally and internationally from major maritime nations.</li>
    <li>Individuals graduating from the Programme are in demand both nationally and internationally because they have a strong and relevant research-based expertise and the ability to innovate and adapt.</li>
    <li>Increase value creation and innovation through close cooperation between academia, maritime industry and public sector.</li>
    <li>The multidisciplinary national competence related to nautical operations constitutes an internationally recognised professional environment that sets the terms for the development of knowledge in the field.</li>
</ol>
<h2>Academic content</h2>
<p>Nautical operations consist of two subject areas:</p>
<ul>
    <li>
        Nautical studies&nbsp;that include navigation, maneuvering and transport of floating craft, and operations, indicating that the PhD program will focus on applied research to support, improve and develop the activities
        undertaken.
    </li>
    <li>
        The operational perspective&nbsp;includes strategic, tactical and operational aspects.&nbsp;Strategic levels include the choice of type and size of a ship fleet.&nbsp;Tactical aspects concern the design of individual ships and
        the selection of equipment and staff.&nbsp;The operational aspects include planning, implementation and evaluation of nautical operations.
    </li>
</ul>
<p>There is a compulsory&nbsp;joint maritime course offered at all the four institutions.</p>

链接到网站: https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/

我正在尝试获取 course_description / about_the_course academic_content 的文本,如上面的“ h2”标签一样。 我完全一无所知,我该如何创建通用代码以根据h2标签取消标签文本。

此外,我认为索引不会有所帮助,因为<'p'>和<'li'>标记的顺序会因课程而异。

3 个答案:

答案 0 :(得分:2)

实际上非常简单。只需标识div标记并在其中打印文本即可。这是执行此操作的完整代码:

from bs4 import BeautifulSoup
import requests

r = requests.get('https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/').text

soup = BeautifulSoup(r,'html5lib')

div_tag = soup.find('div',class_ = "articleelement newtext contentAbove")

print(div_tag.text)

输出:

About the programme
The National Joint PhD Programme in Nautical Operations is organised as a joint degree between the following four national higher education institutions offering professional maritime education:
    Universtity of Tromsø - The Arctic University of Norway (UiT)
    University of South-Eastern Norway (USN)
    Western Norway University of Applied Sciences (HVL)
    Norwegian University of Science and Technology (NTNU)
The National Joint PhD Programme in Nautical Operations will educate qualified candidates for research, teaching, dissemination and innovation work, and other activities requiring scientific insight and an operational maritime focus. 
Implementation of complex nautical operations today requires interdisciplinarity and differentiated competence, including research expertise, for the safe and efficient planning, implementation and evaluation of nautical operations. 
The programme has the following vision: to create an internationally recognized national PhD degree in nautical operations.
This vision will be achieved through the following overall objectives:
    Strengthen the multidisciplinary national expertise in nautical operations through collaboration between the four higher education institutions in Norway with professional maritime education.
    The PhD Programme in Nautical Operations is the preferred Programme in the field and attracts good applicants nationally and internationally from major maritime nations.
    Individuals graduating from the Programme are in demand both nationally and internationally because they have a strong and relevant research-based expertise and the ability to innovate and adapt.
    Increase value creation and innovation through close cooperation between academia, maritime industry and public sector.
    The multidisciplinary national competence related to nautical operations constitutes an internationally recognised professional environment that sets the terms for the development of knowledge in the field.
Academic content
Nautical operations consist of two subject areas:
    Nautical studies that include navigation, maneuvering and transport of floating craft, and operations, indicating that the PhD program will focus on applied research to support, improve and develop the activities undertaken.
    The operational perspective includes strategic, tactical and operational aspects. Strategic levels include the choice of type and size of a ship fleet. Tactical aspects concern the design of individual ships and the selection of equipment and staff. The operational aspects include planning, implementation and evaluation of nautical operations.
There is a compulsory joint maritime course offered at all the four institutions.

这是为了获取文本。如果您只想获取标题,请参见以下完整代码:

from bs4 import BeautifulSoup
import requests

r = requests.get('https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/').text

soup = BeautifulSoup(r,'html5lib')

div_tag = soup.find('div',class_ = "articleelement newtext contentAbove")

headings = div_tag.find_all('h2')

for heading in headings:
    print(heading.text)

输出:

About the programme
Academic content

希望这会有所帮助!

答案 1 :(得分:2)

您可以将.get_text()separator='\n'一起使用:

import requests
from bs4 import BeautifulSoup


url = 'https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

desc = soup.find('h2', text=lambda t: 'About the programme' in t)
print( desc.parent.get_text(strip=True, separator='\n') )

打印:

About the programme
The National Joint PhD Programme in Nautical Operations is organised as a joint degree between the following four national higher education institutions offering professional maritime education:
Universtity of Tromsø
- The Arctic University of Norway (UiT)
University of South-Eastern Norway (USN)
Western Norway University of Applied Sciences
(HVL)
Norwegian University of Science and Technology
(NTNU)
The National Joint PhD Programme in Nautical Operations will educate qualified candidates for research, teaching, dissemination and innovation work, and other activities requiring scientific insight and an operational maritime focus.
Implementation of complex nautical operations today requires interdisciplinarity and differentiated competence, including research expertise, for the safe and efficient planning, implementation and evaluation of nautical operations.
The programme has the following vision: to create an internationally recognized national PhD degree in nautical operations.
This vision will be achieved through the following overall objectives:
Strengthen the multidisciplinary national expertise in nautical operations through collaboration between the four higher education institutions in Norway with professional maritime education.
The PhD Programme in Nautical Operations is the preferred Programme in the field and attracts good applicants nationally and internationally from major maritime nations.
Individuals graduating from the Programme are in demand both nationally and internationally because they have a strong and relevant research-based expertise and the ability to innovate and adapt.
Increase value creation and innovation through close cooperation between academia, maritime industry and public sector.
The multidisciplinary national competence related to nautical operations constitutes an internationally recognised professional environment that sets the terms for the development of knowledge in the field.
Academic content
Nautical operations consist of two subject areas:
Nautical studies that include navigation, maneuvering and transport of floating craft, and operations, indicating that the PhD program will focus on applied research to support, improve and develop the activities undertaken.
The operational perspective includes strategic, tactical and operational aspects. Strategic levels include the choice of type and size of a ship fleet. Tactical aspects concern the design of individual ships and the selection of equipment and staff. The operational aspects include planning, implementation and evaluation of nautical operations.
There is a compulsory joint maritime course offered at all the four institutions.

答案 2 :(得分:1)

您可以尝试使用硒

PATH = "./chromedriver"

driver = webdriver.Chrome(PATH)
driver.implicitly_wait(5)

url = "https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/"
driver.get(url)

path = "//div[@class='articleelement newtext contentAbove']//h2[contains(text(), 'About the programme')]/following-sibling::p"
about_the_program = driver.find_element_by_xpath(path)

path = "//div[@class='articleelement newtext contentAbove']//h2[contains(text(), 'Academic content')]/following-sibling::p"
academic_content = driver.find_element_by_xpath(path)

在这里您找到带有文本h2和/或About the programme的{​​{1}}标签。然后,您正在为Academic content标签的h2标签选择以下同级。如果您希望兄弟姐妹是其他标签,则可以在路径中指定该标签。

修改1

如果您不知道p标签之后的标签,那么您可以尝试一下

h2

此代码将使用列表中的每个标记更新list_of_tags = ['p', 'ul', 'span'] for tag in list_of_tags: path = "//div[@class='articleelement newtext contentAbove']//h2[contains(text(), 'About the programme')]/following-sibling::" try: path = path+tag element_required = driver.find_element_by_xpath(path) except Exception as e: print(e) 变量。如果标签位于path内部,则代码将提取标签,否则代码将显示错误。