Question

我最近学习了很多python来处理一些工作项目。

目前，我需要使用Google搜索结果进行网页抓取。我发现有几个网站演示了如何使用ajax google api进行搜索，但是在尝试使用它之后，它似乎不再受支持了。有什么建议？

我一直在寻找方法，但似乎无法找到目前有效的解决方案。

Answer 1

这是另一种可用于刮除SERP（https://zenserp.com）的服务，它不需要客户端，而且价格便宜。

这是一个python代码示例：

import requests

headers = {
    'apikey': '',
}

params = (
    ('q', 'Pied Piper'),
    ('location', 'United States'),
    ('search_engine', 'google.com'),
    ('language', 'English'),
)

response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)

Answer 2

您可以随时直接抓取Google搜索结果。为此，您可以使用网址https://google.com/search?q=<Query>，这将返回前10个搜索结果。

然后您可以使用lxml来解析页面。根据您使用的内容，您可以通过CSS-Selector（.r a）或使用XPath-Selector（//h3[@class="r"]/a）查询生成的节点树

在某些情况下，生成的网址会重定向到Google。通常它包含一个查询参数q，它将包含实际的请求URL。

使用lxml和请求的示例代码：

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

关于谷歌禁止你的IP的说明：根据我的经验，谷歌只禁止如果您开始使用搜索请求向Google发送垃圾邮件。它会回应如果谷歌认为你是机器人，请使用503。

Answer 3

您有 2 个选择。自行构建或使用 SERP API。

SERP API 会将 Google 搜索结果作为格式化的 JSON 响应返回。

我会推荐 SERP API，因为它更易于使用，而且您不必担心被 Google 屏蔽。

1. SERP API

我对 scraperbox serp api 有很好的体验。

您可以使用以下代码调用 API。确保将 YOUR_API_TOKEN 替换为您的 scraperbox API 令牌。

import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context

# Urlencode the query string
q = urllib.parse.quote_plus("Where can I get the best coffee")

# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q
query += "&proxy_location=gb"

# Call the API.
request = urllib.request.Request(query)

raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)

# Print the first result title
print(response["organic_results"][0]["title"])

2.构建您自己的 Python 抓取工具

我最近在 how to scrape search results with Python 上写了一篇深入的博客文章。

这是一个快速总结。

首先，您应该获取 Google 搜索结果页面的 HTML 内容。

import urllib.request

url = 'https://google.com/search?q=Where+can+I+get+the+best+coffee'

# Perform the request
request = urllib.request.Request(url)

# Set a normal User Agent header, otherwise Google will block the request.
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()

# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")

然后可以使用BeautifulSoup来提取搜索结果。例如，以下代码将获取所有标题。

from bs4 import BeautifulSoup

# The code to get the html contents here.

soup = BeautifulSoup(html, 'html.parser')

# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
    # Search for a h3 tag
    results = div.select("h3")

    # Check if we have found a result
    if (len(results) >= 1):

        # Print the title
        h3 = results[0]
        print(h3.get_text())

您可以扩展此代码以提取搜索结果网址和描述。

Answer 4

您还可以使用Serp API这样的第三方服务，这是Google搜索引擎的结果。它解决了被阻止的问题，您无需租用代理并自行解析结果。

与Python集成很容易：

from lib.google_search_results import GoogleSearchResults

params = {
    "q" : "Coffee",
    "location" : "Austin, Texas, United States",
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
    "api_key" : "demo",
}

query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()

GitHub：https://github.com/serpapi/google-search-results-python

使用Python进行Google搜索网络搜索

4 个答案: