Question

我试图通过搜索＆＃34;咖啡店＆＃34;在Google中，将商店名称，地址等转换为DataFrame，运行一些分析并导出到excel。

尝试使用Pandas read_html并返回＆＃39; HTTPError：HTTP错误403：禁止＆＃39;。知道怎么样？

Answer 1

首先，不鼓励刮痧，因为它违背了他们的ToS。

但是，如果您仍然希望继续使用它们的数据，那么就存在Python的抓取工具：

BeautifulSoup
Scrapy
Requests

我只是假设你正在使用Python。如果您使用的是R，则可以使用：

rvest

或者，您也可以使用他们的Places Search API和Places Details API。

Answer 2

你可以像这样使用selenium webdriver：

from selenium import webdriver
dir = '\\'.join(os.path.dirname(__file__).split("/"))
url="www.example.com"
driver=os.path.join(dir,'chromedriver.exe')
driver.get(url)
# get the address from the html document
for elem in driver.find_elements_by_xpath('.//div[@class = "address"]'):
     address= elem.text

要做到这一点，你需要下载chromedriver。您还需要查看该网页的源代码，以查看您在网页中查找的信息的属性和标记。可以在Example

中找到一个综合示例

Answer 3

你收到错误403，因为你被列入黑名单，google不会让你刮！

您可以找到一些可以使用的技术

<强> Manage blacklisted request with Scrapy

<强> How to prevent getting blacklisted while scraping

Answer 4

您还可以使用Serp API这样的第三方服务，这是Google搜索引擎的结果。它解决了代理和解析的问题。

很容易与Python集成：

from lib.google_search_results import GoogleSearchResults

params = {
    "q" : "Coffee",
    "location" : "Austin, Texas, United States",
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
    "api_key" : "demo",
}

query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()

GitHub：https://github.com/serpapi/google-search-results-python

Python：谷歌搜索结果刮痧

4 个答案: