Question

我正在尝试创建一个可以使用Python获得Bing搜索结果的聊天机器人。我试过很多网站，但它们都使用旧的Python 2代码或Google。我目前在中国，无法访问YouTube，Google或与Google相关的其他任何内容（也不能使用Azure和Microsoft Docs）。我希望结果是这样的：

This is the title
https://this-is-the-link.com

This is the second title
https://this-is-the-second-link.com

代码

import requests
import bs4
import re
import urllib.request
from bs4 import BeautifulSoup
page = urllib.request.urlopen("https://www.bing.com/search?q=programming")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
    print(link["href"])

它给了我

/?FORM=Z9FD1
javascript:void(0);
javascript:void(0);
/rewards/dashboard
/rewards/dashboard
javascript:void(0);
/?scope=web&FORM=HDRSC1
/images/search?q=programming&FORM=HDRSC2
/videos/search?q=programming&FORM=HDRSC3
/maps?q=programming&FORM=HDRSC4
/news/search?q=programming&FORM=HDRSC6
/shop?q=programming&FORM=SHOPTB
http://go.microsoft.com/fwlink/?LinkId=521839
http://go.microsoft.com/fwlink/?LinkID=246338
https://go.microsoft.com/fwlink/?linkid=868922
http://go.microsoft.com/fwlink/?LinkID=286759
https://go.microsoft.com/fwlink/?LinkID=617297

任何帮助将不胜感激（我在Ubuntu上使用Python 3.6.9）

Answer 1

实际上，您编写的代码可以正常工作，问题出在HTTP请求标头中。默认情况下，urllib使用Python-urllib/{version}作为User-Agent标头值，这使网站很容易将请求识别为自动生成的。为避免这种情况，您应该使用自定义值，该值可以通过将Request对象作为urlopen()的第一个参数来实现：

from urllib.parse import urlencode, urlunparse
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup

query = "programming"
url = urlunparse(("https", "www.bing.com", "/search", "", urlencode({"q": query}), ""))
custom_user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
req = Request(url, headers={"User-Agent": custom_user_agent})
page = urlopen(req)
# Further code I've left unmodified
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
    print(link["href"])

P.S。看看问题旁边@edd留下的评论。

Answer 2

您可以通过 SerpApi 抓取 Bing Search Engine Results。这是一个免费试用的付费 API。

Full example

import os

# https://pypi.org/project/google-search-results/
from serpapi import BingSearch

search = BingSearch({
    "q": "programming",
    "count": 50,
    "api_key": os.getenv("API_KEY")
})
data = search.get_json()

print("Organic results\n")

for organic_result in data['organic_results']:
    print(
        f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n")

输出

Organic results

Title: Computer programming - Wikipedia
Link: https://en.wikipedia.org/wiki/Computer_programming

Title: Programming | Definition of Programming by Merriam-Webster
Link: https://www.merriam-webster.com/dictionary/programming

Title: Computer programming | Computing | Khan Academy
Link: https://www.khanacademy.org/computing/computer-programming

Title: What is Programming? (video) | Khan Academy
Link: https://www.khanacademy.org/computing/computer-programming/programming/intro-to-programming/v/programming-intro

Stripped...

免责声明：我在 SerpApi 工作。

在Python中获取Bing搜索结果

2 个答案: