从搜索字符串或URL获取Google搜索结果URL

时间:2019-11-23 14:38:30

标签: python web-scraping beautifulsoup

因此,我想查找所有搜索结果并将它们存储在列表中。分析Google页面可以发现,所有结果在技术上都属于g类:

Google Search analysis

从技术上讲,从搜索结果页面提取网址(即)应该很容易:

import urllib
from bs4 import BeautifulSoup
import requests

text = 'cyber security'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)

但是,我没有输出。为什么?

编辑:即使手动解析存储的页面也无济于事:

with open('output.html', 'wb') as f:
     f.write(response.content)
webbrowser.open('output.html')

url = "output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")

#soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)

4 个答案:

答案 0 :(得分:1)

您可以随时向上或向下攀爬多个元素进行测试。所有结果都在具有 <div id="display"></div> 类的 <div> 元素中。

抓取网址就像:

  1. .tF2Cxcfor loop .select() 方法结合使用,该方法将 СSS 选择器作为输入。
  2. 使用 bs4 方法调用 .yuRUbf CSS 选择器。
  3. 使用 .select_one() 属性调用 <a> 标记。
href

变成这样(online IDE 中的示例):

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf').a['href']

或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。

要集成的代码:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'cyber security'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# containver with all needed data
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf').a['href']
  print(link)

# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://digitalguardian.com/blog/what-cyber-security
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://en.wikipedia.org/wiki/Computer_security
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
https://staysafeonline.org/
'''
<块引用>

免责声明,我为 SerpApi 工作。

答案 1 :(得分:0)

from selenium import webdriver
from bs4 import BeautifulSoup
import time

browser = webdriver.Firefox()
dork = 'cyber security'
sada = browser.get(f"https://www.google.com/search?q={dork}")
time.sleep(5)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')

for item in soup.findAll('div', attrs={'class': 'r'}):
    for href in item.findAll('a'):
        print(href.get('href'))

答案 2 :(得分:0)

以下方法应该从其目标页面的总结果链接中获取一些随机链接。您可能需要踢出一些以点结尾的链接。使用请求从Google搜索中获取链接确实是一项艰巨的工作。

import requests
from bs4 import BeautifulSoup

url = "http://www.google.com/search?q={}&hl=en"

def scrape_google_links(query):
    res = requests.get(url.format(query.replace(" ","+")),headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    for result in soup.select(".kCrYT > a > .BNeawe:nth-of-type(2)"):
        print(result.text.replace(" › ","/"))

if __name__ == '__main__':
    scrape_google_links('cyber security')

答案 3 :(得分:-1)

实际上,如果您打印 response.content 并检查输出,您会发现没有 class g 的HTML标签。这些元素似乎是通过动态加载而来的,而 BeautifulSoap 仅加载静态内容。这就是为什么当您使用类g查找HTML标签时,结果中没有任何元素的原因。