谷歌搜索使用bs4,python

时间:2018-06-25 06:41:23

标签: python web-scraping beautifulsoup

我想通过python脚本中的Google搜索获取“ Spotlight 29 casino address”的地址。为什么我的代码无法正常工作?

from bs4 import BeautifulSoup
# from googlesearch import search
import urllib.request
import datetime
article='spotlight 29 casino address'
url1 ='https://www.google.co.in/#q='+article
content1 = urllib.request.urlopen(url1)
soup1 = BeautifulSoup(content1,'lxml')
#print(soup1.prettify())
div1 = soup1.find('div', {'class':'Z0LcW'}) #get the div where it's located
# print (datetime.datetime.now(), 'street address:  ' , div1.text)
print (div1)

Pastebin Link

3 个答案:

答案 0 :(得分:0)

如果您想获得Google搜索结果。 Selenium with Python是更简单的方法。

下面是简单的代码。

from selenium import webdriver
import urllib.parse
from bs4 import BeautifulSoup

chromedriver = '/xxx/chromedriver' #xxx is chromedriver in your installed path
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chromedriver, chrome_options=chrome_options)

article='spotlight 29 casino address'
driver.get("https://www.google.co.in/#q="+urllib.parse.quote(article))
# driver.page_source  <-- html source, you can parser it later.
soup = BeautifulSoup(driver.page_source, 'lxml')
div = soup.find('div',{'class':'Z0LcW'})
print(div.text)
driver.quit()

答案 1 :(得分:0)

Google为此使用javascript渲染,这就是为什么您不会通过urllib.request.urlopen接收该div。

作为解决方案,您可以使用selenium-用于仿真浏览器的python库。使用“ pip install selenium”控制台命令安装它,然后这样的代码将起作用:

from bs4 import BeautifulSoup
from selenium import webdriver


article = 'spotlight 29 casino address'
url = 'https://www.google.co.in/#q=' + article
driver = webdriver.Firefox()
driver.get(url)
html = BeautifulSoup(driver.page_source, "lxml")

div = html.find('div', {'class': 'Z0LcW'})
print(div.text)

答案 2 :(得分:0)

您得到一个空的 div 因为默认情况下有一个 python-requests 如果您使用的是 requests 库 (info)(或类似的东西){{1}并且您的请求被 Google 阻止。使用 user-agent,您可以伪造用户浏览器访问。

如果地址是 HTML 代码(在本例中是这样),您可以在没有硒的情况下通过添加 user-agent 来实现它。

user agent

这是代码和full example

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

输出:

from bs4 import BeautifulSoup
import requests
import lxml

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
    'https://www.google.com/search?q=spotlight 29 casino address',
    headers=headers)

html = response.text
soup = BeautifulSoup(html, 'lxml')

print(soup.select_one(".sXLaOe, .iBp4i").text)