美丽的汤从谷歌搜索提取href

时间:2012-04-28 22:33:18

标签: python html beautifulsoup google-search

谷歌搜索为我提供了以下关于HTML的第一个结果:

<h3 class="r"><a href="https://rads.stackoverflow.com/amzn/click/com/0470284889" rel="nofollow noreferrer" class="l vst" onmousedown="return rwt(this,'','','','1','AFQjCNEv1W9YC2jcSKYdEo2kNqBMJ-Utmg','k89K9hF4cVNpxQYHtEKiUQ','0CCoQFjAA',null,event)"><em>Quantitative Trading</em>: <em>How to Build Your Own Algorithmic</em> <b>...</b> - Amazon</a></h3>

我想从中提取链接http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889,但当我使用美丽的汤来提取信息时,我获得了

soup.find("h3").find("a").get("href")

我改为获得以下字符串:

/ URL Q = http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGeA7A

我知道链接在那里,我可以通过删除/ url来解析它?q =和&amp;之后的所有内容符号,但我想知道是否有更清洁的解决方案。

谢谢!

2 个答案:

答案 0 :(得分:1)

要仅从页面中提取第一个结果,您可以通过传递 CSS 选择器或 select_one() bs4 方法使用 find()

代码和example in the online IDE

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

# passing parameters in URLs
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {'q': 'Quantitative Trading How to Build Your Own Algorithmic - amazon'}

def bs4_get_first_googlesearch():
    html = requests.get('https://www.google.com/search', headers=headers, params=params).text
    soup = BeautifulSoup(html, 'lxml')

    first_link = soup.select_one('.yuRUbf').a['href']
    print(first_link)

bs4_get_first_googlesearch()

# output:
'''
https://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889
'''

或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。查看playground

最大的不同在于,一切都已经为最终用户完成了:选择元素、绕过阻止、代理轮换等等。

要集成的代码:

from serpapi import GoogleSearch
import os

def serpapi_get_first_googlesearch():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "Quantitative Trading How to Build Your Own Algorithmic - amazon",
      "hl": "en",
    }

    search = GoogleSearch(params)
    results = search.get_dict()
    # [0] - first element from the search results
    first_link = results['organic_results'][0]['link']
    print(first_link)

serpapi_get_first_googlesearch()

# output:
'''
https://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889
'''
<块引用>

免责声明,我为 SerpApi 工作。

答案 1 :(得分:0)

您可以使用urlparse.urlparseurlparse.parse_qs的组合,例如

>>> import urlparse
>>> url = '/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'
>>> data = urlparse.parse_qs(
...     urlparse.urlparse(url).query
... )
>>> data
{'ei': ['P2ycT6OoNuasiAL2ncV5'],
 'q': ['http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'],
 'sa': ['U'],
 'usg': ['AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'],
 'ved': ['0CBIQFjAA']}
>>> data['q'][0]
'http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'