抓取Google搜索结果页数据python

时间:2020-01-16 20:20:58

标签: python web-scraping beautifulsoup request python-requests

我想在搜索结果查询中删除电子邮件。但是当我使用CSS选择器“选择”访问类并进行打印时,它始终显示为空列表。如何访问.r类或“ class = g”?

    import requests
    from bs4 import BeautifulSoup

    url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"
    responce = requests.get(url)
    soup = BeautifulSoup(responce.text, "html.parser")
    test = soup.select('.r')
    print(test)

2 个答案:

答案 0 :(得分:0)

您的程序是正确的,但是要从Google获得正确答案,您需要指定User-Agent标头:

导入请求 从bs4导入BeautifulSoup

url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"

headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'}

responce = requests.get(url, headers=headers)  # <-- specify custom header
soup = BeautifulSoup(responce.text, "html.parser")
test = soup.select('.r')
print(test)

打印:

[<div class="r"><a href="https://www.yahoo.com/news/11-course-complete-computer-science-171322233.html" onmousedown="return rwt(this,'','','','1','AOvVaw2wM4TUxc_4V7s9GjeWTNAG','','2ahUKEwjt17Kk-YjnAhW2R0EAHcnsC3QQFjAAegQIAxAB','','',event)"><div class="TbwUpd"><img alt="https://...
...

答案 1 :(得分:0)

要从 Google 搜索结果中获取电子邮件,您需要使用 regex

# this regex needs possible modifications
re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', string_where_to_search_from)

代码:

from bs4 import BeautifulSoup
import requests, lxml, re

headers = {
    "User-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
    "Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q="computer science ""usa" "@yahoo.com"', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')


for result in soup.select('.tF2Cxc'):
    try:
        snippet = result.select_one('.lyLwlc').text
    except:
        snippet = None

    match_email = re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', str(snippet))
    email = '\n'.join(match_email).strip()
    print(email)

----------
'''
ahmed_733@yahoo.com
yjzou@uguam.uog
yzou2002@yahoo.com
...

或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。

它不会使用正则表达式提取电子邮件,尽管它可能是一个很棒的功能。主要区别在于,完成任务比从头开始创建更容易、更快捷。

要集成的代码:

from serpapi import GoogleSearch
import re

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": '"computer science ""usa" "@yahoo.com"',
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
    try:
        snippet = result['snippet']
    except:
        snippet = None

    match_email = re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', str(snippet))
    email = '\n'.join(match_email).strip()
    print(email)

---------
'''
shaikotweb@yahoo.com
ahmed_733@yahoo.com
RPeterson@L1id.com
rj_peterson@yahoo.com
'''
<块引用>

免责声明,我为 SerpApi 工作。