如何使用Python和BeautifulSoup抓取多个Google页面

时间:2019-11-20 14:03:33

标签: python beautifulsoup

我写了一个可以抓取Google新闻搜索结果的代码。但是它总是只刮首页。 如何编写一个循环,使我可以抓取前2,3 ... n页?

我知道在struct stack{ int top; char data[]; } *s; void initStack() { s = malloc(sizeof(struct stack) + MAX * sizeof(char)); } 中,我需要为页面添加参数,并将所有参数都放在url中,但是我不知道如何?

此代码为我提供了第一个搜索页面的标题,段落和日期:

for loop

此外,可以将from bs4 import BeautifulSoup import requests headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'} term = 'usa' url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)# i know that I need to add this parameter for page, but I do not know how response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') headline_text = soup.find_all('h3', class_= "r dO0Ag") snippet_text = soup.find_all('div', class_='st') news_date = soup.find_all('div', class_='slp') 和页面的这种逻辑应用于例如google newsbing news,我的意思是,我可以使用相同的参数还是{{1} }不同吗?

1 个答案:

答案 0 :(得分:2)

我认为您需要更改网址。请尝试以下代码,看看是否可行。

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

term = 'usa'
page=0


while True:
    url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
    print(url)

    response = requests.get(url, headers=headers,verify=False)
    if response.status_code!=200:
        break
    soup = BeautifulSoup(response.text, 'html.parser')

    headline_text = soup.find_all('h3', class_= "r dO0Ag")

    snippet_text = soup.find_all('div', class_='st')

    news_date = soup.find_all('div', class_='slp')
    page=page+10