Question

我正在尝试使用Python抓取此网站：“ https://ec.europa.eu/research/mariecurieactions/how-to/find-job_en”。

首先，我注意到我感兴趣的表实际上位于以下URL：https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm

但是，requests + BS4只给了我HTML的页面源。我认为这是因为内容是动态的。

因此，我尝试使用Selenium + BS4抓取网站，但我仍然只能抓取页面源。

from selenium.webdriver import Firefox
from bs4 import BeautifulSoup
import lxml

driver = Firefox()
url = 'https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')

如何刮除上述网站？

Answer 1

如果走得更远，您会在这里找到真实的数据：https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml 这是使用SimplifiedDoc的示例。

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml') 
doc = SimplifiedDoc(html)
jobs = doc.selects('job-opportunity')
for job in jobs:
    print (job.select('job-id>text()'),job.select('job-title>text()'))

结果：

367020 Early-Stage Researcher (ESR) 3-year PhD position - "Efficient intra-cavity and extra-cavity generation of beams with radial and azimuthal polarization in high-power thin-disk lasers" - Project: GREAT
377512 8 Short-term Early Stage Researcher positions available through the EvoCELL ITN (single cell genomics, evo-devo and science outreach)
383978 ESR (early stage researcher) for intelligent quality control cycles in Industry 4.0 process chains enabled by machine learning
......

Answer 2

实际上，您可以使用 requests + BS4 来获得所需的结果。您需要做的就是将 API CREATE OR ALTER FUNCTION dbo.toBinaryString(@p INT) RETURNS VARCHAR(24) AS BEGIN RETURN REVERSE(REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( FORMAT(@p,'X8'), '0', '....'), '1', '...x'),'2', '..x.'),'3', '..xx'), '4', '.x..'), '5', '.x.x'),'6', '.xx.'),'7', '.xxx'), '8', 'x...'), '9', 'x..x'),'A', 'x.x.'),'B', 'x.xx'), 'C', 'xx..'), 'D', 'xx.x'),'E', 'xxx.'),'F', 'xxxx'), '.','0'),'x','1')) END 与标头一起使用。

代码

https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml

输出

import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'euraxess.ec.europa.eu',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
    'accept': 'application/xml, text/xml, */*; q=0.01',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
    'origin': 'https://ec.europa.eu',
    'sec-fetch-site': 'same-site',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://ec.europa.eu/',
    'accept-language': 'en-US,en;q=0.9',
}

response = requests.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml',headers=headers)
# print(response.text)

soup = BeautifulSoup(response.content, 'html.parser')
ID = soup.find_all('job-id')
Title = soup.find_all('job-title')
for ID,Title in zip(ID,Title):
    print(ID.text,Title.text)

使用python抓取动态javascript内容网页

2 个答案: