使用Beautiful Soup从Google搜索中提取数据/链接

时间:2016-02-23 22:55:31

标签: javascript python html beautifulsoup bs4

Evening Folks,

我正在尝试向Google询问一个问题,并从其尊重的搜索查询中提取所有相关链接(即我搜索“site:Wikipedia.com Thomas Jefferson”,它给了我wiki.com/jeff,wiki.com / tom等。)

这是我的代码:

from bs4 import BeautifulSoup
from urllib2 import urlopen

query = 'Thomas Jefferson'

query.replace (" ", "+")
#replaces whitespace with a plus sign for Google compatibility purposes

soup = BeautifulSoup(urlopen("https://www.google.com/?gws_rd=ssl#q=site:wikipedia.com+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

for item in soup.find_all('h3', attrs={'class' : 'r'}):
    print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

这里的目标是让我设置查询变量,让python查询Google,而Beautiful Soup会提取所有“绿色”链接,如果你愿意的话。

Here is a picture of a Google results page

我只想完全拉绿色环节。奇怪的是,谷歌的源代码是“隐藏的”(他们的搜索架构的一个症状),所以美丽的汤不能只是从h3标签中拉出一个href。当我检查元素时,我能够看到h3 hrefs,但是当我查看源时,我看不到。

Here is a picture of the Inspect Element

我的问题是:如果我无法访问他们的源代码,只检查元素,如何从Google通过BeautifulSoup拉出前5个最相关的绿色链接?

PS:为了了解我想要实现的目标,我发现了两个比较接近的Stack Overflow问题,比如我的:

beautiful soup extract a href from google search

How to collect data of Google Search with beautiful soup using python

3 个答案:

答案 0 :(得分:4)

当我尝试使用禁用的JavaScript进行搜索时,我得到的网址与Rob M.不同 -

https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw

要使这项工作适用于任何查询,您应首先确保您的查询中没有空格(这就是为什么您会收到400:错误请求)。您可以使用urllib.quote_plus()

执行此操作
query = "Thomas Jefferson"
query = urllib.quote_plus(query)

将urlen将所有空格编码为加号 - 创建有效的URL。

然而 ,这与urllib一起工作 - 你得到403:Forbidden。我使用这样的python-requests模块让它工作:

import requests
import urllib
from bs4 import BeautifulSoup

query = 'Thomas Jefferson'
query = urllib.quote_plus(query)

r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

打印链接提供:

print links
#  [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
#   u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
#   u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
#   u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
#   u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
#   u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
#   u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']

答案 1 :(得分:1)

如果您得到空结果,则需要指定 user-agent。这可能是原因之一。 我稍微简化了代码,删除了代码中的 query 变量。

要测试的代码和 full example

from bs4 import BeautifulSoup
import requests
import lxml

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
    'https://www.google.com/search?q=site:wikipedia.com thomas edison',
    headers=headers).text

soup = BeautifulSoup(response, 'lxml')

for link in soup.find_all('div', class_='yuRUbf'):
    links = link.a['href']
    print(links)

输出:

https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw

或者,您可以使用来自 SerpApi 的 Google Search Engine Results API

部分 JSON 输出:

{
 "position": 1,
 "title": "Thomas Edison - Wikipedia",
 "link": "https://en.wikipedia.org/wiki/Thomas_Edison",
 "displayed_link": "en.wikipedia.org › wiki › Thomas_Edison",
 "snippet": "Thomas Alva Edison (February 11, 1847 – October 18, 1931) was an American inventor and businessman who has been described as America's greatest ..."
}

要集成的代码:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "site:wikipedia.com thomas edison",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
     print(f"Link: {result['link']}")

输出:

Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw
<块引用>

免责声明,我为 SerpApi 工作。

答案 2 :(得分:0)

这不适用于哈希搜索(#q=site:wikipedia.com,就像你拥有它一样),因为它通过AJAX加载数据而不是为结果提供完整的可解析HTML,你应该使用它来代替:

soup = BeautifulSoup(urlopen("https://www.google.com/search?gbv=1&q=site:wikipedia.com+" + query), "html.parser")

作为参考,我禁用了javascript并执行了Google搜索以获取此网址结构。