无法修改现有逻辑以解析下一页的标题

时间:2019-07-27 13:23:49

标签: python python-3.x web-scraping

我已经使用请求模块在python中创建了一个脚本,以在titles中启动搜索后获取duckduckgo.com的不同项目。我的搜索关键字是 板球 。我的脚本正在从首页完美地解析titles

Website address

随着titlestwo fields的怪异增长,如params's': '0'一样,我在解析下一页的'dc': '-27'时遇到了麻烦。但是,其余字段是静态的。

要从第一页解析titles,我尝试了以下操作(有效):

import requests
from bs4 import BeautifulSoup

URL = "https://duckduckgo.com/html/"

params = {
    'q': 'python',
    's': '0',
    'nextParams': '',
    'v': 'l',
    'o': 'json',
    'dc': '-27',
    'api': 'd.js',
    'kl': 'us-en'
}

resp = requests.post(URL,data=params,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(resp.text,"lxml")
for title in soup.select(".result__body .result__a"):
    print(title.text)

参数的两个字段正以如下方式增加:

第一页:

's': '0'
'dc': '-27'

第二页:

's': '30'
'dc': '27'

第三页:

's': '80'
'dc': '76'

第四页:

's': '130'
'dc': '126'

我如何也可以从下一页抓取标题?

1 个答案:

答案 0 :(得分:1)

下一页的参数每次都保存在POST响应中

import requests
from bs4 import BeautifulSoup

URL = "https://duckduckgo.com/html/"

params = {
    'q': 'python',
    's': '0',
    'nextParams': '',
    'v': 'l',
    'o': 'json',
    'dc': '0',
    'api': 'd.js',
    'kl': 'us-en'
}

with requests.Session() as s:  
    while True:
        resp = s.post(URL,data=params,headers={"User-Agent":"Mozilla/5.0"})
        soup = BeautifulSoup(resp.text,"lxml")
        for title in soup.select(".result__body .result__a"):
            print(title.text)
        for i in soup.select('form:not(.header__form) [type=hidden]'):  #updated params based on response
            params[i['name']] = i['value']
        if not soup.select_one('[value=Next]'):
            break