使用Python和BeautifulSoup为Yahoo和Bing报废多个页面生成URL

时间:2019-11-26 09:11:13

标签: python beautifulsoup

我想从其他来源抓取新闻。我找到了一种生成网址以从Google抓取多个页面的方法,但是我认为有一种方法可以生成更短的链接。

能否请您告诉我如何生成用于抓取Bing和Yahoo新闻的多个页面的URL,并且还有一种方法可以使Google网址更短。

这是google的代码:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

term = 'usa'
page=0

for page in range(1,5):

    page = page*10

    url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
    print(url)

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

这些是yahoo和bing的URL,但只有1页:

yahoo:url = 'https://news.search.yahoo.com/search?q={}'.format(term) bing:url = 'https://www.bing.com/news/search?q={}'.format(term)

1 个答案:

答案 0 :(得分:1)

我不确定您是否正在寻找这个简短的新闻网址。

let array = [1, 3, 3, 3, 5, 5, 5, 5, 5, 5, 7];

console.log(binarySearch(array, 0)); // Gives [ -1,  0 ] <= No value found, note that resulting range covers area beyond array boundaries
console.log(binarySearch(array, 1)); // Gives [  0,  0 ] <= Singular range (only one value found)
console.log(binarySearch(array, 2)); // Gives [  0,  1 ] <= Queried value not found, however the range covers argument value
console.log(binarySearch(array, 3)); // Gives [  1,  3 ] <= Multiple values found
console.log(binarySearch(array, 4)); // Gives [  3,  4 ] <= Queried value not found, however the range covers argument value
console.log(binarySearch(array, 5)); // Gives [  4,  9 ] <= Multiple values found
console.log(binarySearch(array, 6)); // Gives [  9, 10 ] <= Queried value not found, however the range covers argument value
console.log(binarySearch(array, 7)); // Gives [ 10, 10 ] <= Singular range (only one value found)
console.log(binarySearch(array, 8)); // Gives [ 10, 11 ] <= No value found, note that resulting range covers area beyond array boundaries

#Yahoo:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

term = 'usa'
page=0

for page in range(1,5):

    page = page*10

    url = 'https://www.google.com/search?q={}&tbm=nws&start={}'.format(term,page)
    print(url)

    response = requests.get(url, headers=headers,verify=False)
    soup = BeautifulSoup(response.text, 'html.parser')