我正在尝试使用Python抓取此网站:“ https://ec.europa.eu/research/mariecurieactions/how-to/find-job_en”。
首先,我注意到我感兴趣的表实际上位于以下URL:https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm
但是,requests + BS4只给了我HTML的页面源。我认为这是因为内容是动态的。
因此,我尝试使用Selenium + BS4抓取网站,但我仍然只能抓取页面源。
from selenium.webdriver import Firefox
from bs4 import BeautifulSoup
import lxml
driver = Firefox()
url = 'https://ec.europa.eu/assets/eac/msca/jobs/import-jobs_en.htm'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
如何刮除上述网站?
答案 0 :(得分:0)
如果走得更远,您会在这里找到真实的数据:https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml 这是使用SimplifiedDoc的示例。
from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml')
doc = SimplifiedDoc(html)
jobs = doc.selects('job-opportunity')
for job in jobs:
print (job.select('job-id>text()'),job.select('job-title>text()'))
结果:
367020 Early-Stage Researcher (ESR) 3-year PhD position - "Efficient intra-cavity and extra-cavity generation of beams with radial and azimuthal polarization in high-power thin-disk lasers" - Project: GREAT
377512 8 Short-term Early Stage Researcher positions available through the EvoCELL ITN (single cell genomics, evo-devo and science outreach)
383978 ESR (early stage researcher) for intelligent quality control cycles in Industry 4.0 process chains enabled by machine learning
......
答案 1 :(得分:0)
实际上,您可以使用 requests + BS4 来获得所需的结果。您需要做的就是将 API CREATE OR ALTER FUNCTION dbo.toBinaryString(@p INT)
RETURNS VARCHAR(24)
AS
BEGIN
RETURN REVERSE(REPLACE( REPLACE(
REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( REPLACE( REPLACE( FORMAT(@p,'X8'),
'0', '....'), '1', '...x'),'2', '..x.'),'3', '..xx'),
'4', '.x..'), '5', '.x.x'),'6', '.xx.'),'7', '.xxx'),
'8', 'x...'), '9', 'x..x'),'A', 'x.x.'),'B', 'x.xx'),
'C', 'xx..'), 'D', 'xx.x'),'E', 'xxx.'),'F', 'xxxx'),
'.','0'),'x','1'))
END
与标头一起使用。
代码
https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml
输出
import requests
from bs4 import BeautifulSoup
headers = {
'authority': 'euraxess.ec.europa.eu',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/xml, text/xml, */*; q=0.01',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
'origin': 'https://ec.europa.eu',
'sec-fetch-site': 'same-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://ec.europa.eu/',
'accept-language': 'en-US,en;q=0.9',
}
response = requests.get('https://euraxess.ec.europa.eu/sites/default/files/exports/msca.xml',headers=headers)
# print(response.text)
soup = BeautifulSoup(response.content, 'html.parser')
ID = soup.find_all('job-id')
Title = soup.find_all('job-title')
for ID,Title in zip(ID,Title):
print(ID.text,Title.text)