Question

我正在尝试使用请求模块在python中创建脚本，以从网站上抓取不同职位的标题。要解析不同职位的标题，我需要首先从该站点获得相关响应，以便我可以使用BeautifulSoup处理内容。但是，当我运行以下脚本时，我可以看到该脚本生成了 gibberish ，而这些文字实际上不包含我要查找的标题。

website link（In case you don't see any data, make sure to refresh the page）

我尝试过：

import requests
from bs4 import BeautifulSoup

link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'

query_string = {
    'page': '1',
    'position': '235',
    'type': '',
    'city': '',
    'region': ''
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
    s.headers.update({"Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=&region="})
    res = s.get(link,params=query_string)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
        print(item.text)

我什至尝试过这样：

import urllib.request
from bs4 import BeautifulSoup
from urllib.parse import urlencode

link = 'https://www.alljobs.co.il/SearchResultsGuest.aspx?'

query_string = {
    'page': '1',
    'position': '235',
    'type': '',
    'city': '',
    'region': ''
}

headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
    "Referer":"https://www.alljobs.co.il/SearchResultsGuest.aspx?page=2&position=235&type=&city=&region="  
}

def get_content(url,params):
    req = urllib.request.Request(f"{url}{params}",headers=headers)
    res = urllib.request.urlopen(req).read()
    soup = BeautifulSoup(res,"lxml")
    for item in soup.select(".job-content-top [class^='job-content-top-title'] a[title]"):
        yield item.text

if __name__ == '__main__':
    params = urlencode(query_string)
    for item in get_content(link,params):
        print(item)

如何使用请求获取不同作业的标题？

PS Browser模拟器不是执行任务的选项。

Answer 1

我想看看你的胡言乱语。在运行您的代码时，我得到了一堆希伯来语字符（毫不奇怪，因为该网站是希伯来语）和职位：

לחברתהייטקמובילה，IT项目经理 םAllStars-IT集团（MT）אלתמערכותמגייסתמפתח/תJAVAלגוףרפואיגדולהיושבבתלאביב！ דרושיםאלעדמערכות נתח/תמערכותומאפיין/ת דרושיםמרטנסהופמןשירותימחשוב אנשי/נשותתפעולותמיכהטכניתלמוצראינטרנטי דרושיםהמימדהשלישי DBA SQL / ORACLE םרושיםCPS职位 דרושים/ותאנשי/נשותתמיכהעלמערכתפריוריטי，שכרמתגמללמתאימים/ות דרושיםחברהוןאנושי פתח/תSAP ABAP דרושיםטאוארסמיקונדקטור 数据分析总监 דרושיםאופיסופט 全栈开发人员 םSQLink פתח/תתיותדאטהותומךתשתיתBI שרושיםהמימדהשביעיבע"מ פתח/תתיותדאטהותומך/תתשתיתBI םרושיםיוניטסק 阿拉伯联合王国/阿拉伯联合酋长国/ ABAP םרושיםיוניטסק / / / / / / /תתקקק 塔尔多（Taldor）的照片 שרוש/המפתח/תאינטגרציה םSQLink שרוש/הראשצוות全栈 תכנת/ת 高级软件工程师经理高级软件工程师资深嵌入式软件工程师嵌入式软件工程师高级软件工程师子公司PMM经理 תןוכניתן/ית后端全栈/前端软件工程师软件验证工程师首席产品经理量子算法研究实习生校长/高级检测组组长支援工程师软件工程师

您要过滤希伯来语字符吗？因为那只需要简单的正则表达式！导入re软件包，然后使用以下内容替换您的打印语句：

print(re.sub('[^A-z0-9]+',' ',item.text))

希望这会有所帮助！

Answer 2

要成功获得预期的请求，您必须使用cookie。对于URL，您需要#include <stdio.h> #include <stdlib.h> #include <string.h> struct Student { char name[30]; int id; int score; int score2; int score3; }; int main() { printf("Please input the information below into grade.data!\n"); printf("Ends with name's value equal to 'E'\n"); struct Student stu[10]; int i = 0, maxlength = 0; //size of longest name printf("Name No Math Chi Eng\n"); while (true){ scanf("%s", stu[i].name); if(maxlength < strlen(stu[i].name)) maxlength = strlen(stu[i].name); if (stu[i].name[0] == 'E') break; scanf("%d", &stu[i].id); scanf("%d", &stu[i].score1); scanf("%d", &stu[i].score2); scanf("%d", &stu[i].score3); i++; } FILE* fp; fp = fopen("test.txt", "wb"); if (fp == NULL) { printf("Open file error!"); exit(-1); } fwrite(&stu, sizeof(struct Student), 1, fp); fclose(fp); printf("Name%-*c No%-*c Math Chi Eng\n", maxlength-4, ' ', 7, ' '); for (int i = 0; stu[i].name[0] != 'E'; i++) { printf("%-*s ", maxlength, stu[i].name); printf("%-*d ", 9, stu[i].id); printf("%-*d ", 4, stu[i].score1); printf("%-*d ", 3, stu[i].score2); printf("%d\n", stu[i].score3); } return 0; } cookie就足够了。您可以手动获得它，如果它会过期，则可以使用Selenium和Proxy Server实施解决方案以刷新它并继续使用rbzid进行抓取。

requests

如何使用请求从网站上抓取不同职位的标题？

2 个答案: