因此,我想查找所有搜索结果并将它们存储在列表中。分析Google页面可以发现,所有结果在技术上都属于g
类:
从技术上讲,从搜索结果页面提取网址(即)应该很容易:
import urllib
from bs4 import BeautifulSoup
import requests
text = 'cyber security'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]
print(hrefs)
但是,我没有输出。为什么?
编辑:即使手动解析存储的页面也无济于事:
with open('output.html', 'wb') as f:
f.write(response.content)
webbrowser.open('output.html')
url = "output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")
#soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]
print(hrefs)
答案 0 :(得分:1)
您可以随时向上或向下攀爬多个元素进行测试。所有结果都在具有 <div id="display"></div>
类的 <div>
元素中。
抓取网址就像:
.tF2Cxc
与 for loop
.select()
方法结合使用,该方法将 СSS 选择器作为输入。bs4
方法调用 .yuRUbf
CSS 选择器。.select_one()
属性调用 <a>
标记。href
变成这样(online IDE 中的示例):
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf').a['href']
或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。
要集成的代码:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'cyber security'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# containver with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf').a['href']
print(link)
# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://digitalguardian.com/blog/what-cyber-security
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://en.wikipedia.org/wiki/Computer_security
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
https://staysafeonline.org/
'''
<块引用>
免责声明,我为 SerpApi 工作。
答案 1 :(得分:0)
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Firefox()
dork = 'cyber security'
sada = browser.get(f"https://www.google.com/search?q={dork}")
time.sleep(5)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
for item in soup.findAll('div', attrs={'class': 'r'}):
for href in item.findAll('a'):
print(href.get('href'))
答案 2 :(得分:0)
以下方法应该从其目标页面的总结果链接中获取一些随机链接。您可能需要踢出一些以点结尾的链接。使用请求从Google搜索中获取链接确实是一项艰巨的工作。
import requests
from bs4 import BeautifulSoup
url = "http://www.google.com/search?q={}&hl=en"
def scrape_google_links(query):
res = requests.get(url.format(query.replace(" ","+")),headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for result in soup.select(".kCrYT > a > .BNeawe:nth-of-type(2)"):
print(result.text.replace(" › ","/"))
if __name__ == '__main__':
scrape_google_links('cyber security')
答案 3 :(得分:-1)
实际上,如果您打印 response.content 并检查输出,您会发现没有 class g 的HTML标签。这些元素似乎是通过动态加载而来的,而 BeautifulSoap 仅加载静态内容。这就是为什么当您使用类g查找HTML标签时,结果中没有任何元素的原因。