我想在搜索结果查询中删除电子邮件。但是当我使用CSS选择器“选择”访问类并进行打印时,它始终显示为空列表。如何访问.r类或“ class = g”?
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"
responce = requests.get(url)
soup = BeautifulSoup(responce.text, "html.parser")
test = soup.select('.r')
print(test)
答案 0 :(得分:0)
您的程序是正确的,但是要从Google获得正确答案,您需要指定User-Agent
标头:
导入请求 从bs4导入BeautifulSoup
url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'}
responce = requests.get(url, headers=headers) # <-- specify custom header
soup = BeautifulSoup(responce.text, "html.parser")
test = soup.select('.r')
print(test)
打印:
[<div class="r"><a href="https://www.yahoo.com/news/11-course-complete-computer-science-171322233.html" onmousedown="return rwt(this,'','','','1','AOvVaw2wM4TUxc_4V7s9GjeWTNAG','','2ahUKEwjt17Kk-YjnAhW2R0EAHcnsC3QQFjAAegQIAxAB','','',event)"><div class="TbwUpd"><img alt="https://...
...
答案 1 :(得分:0)
要从 Google 搜索结果中获取电子邮件,您需要使用 regex
# this regex needs possible modifications
re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', string_where_to_search_from)
代码:
from bs4 import BeautifulSoup
import requests, lxml, re
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q="computer science ""usa" "@yahoo.com"', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
try:
snippet = result.select_one('.lyLwlc').text
except:
snippet = None
match_email = re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', str(snippet))
email = '\n'.join(match_email).strip()
print(email)
----------
'''
ahmed_733@yahoo.com
yjzou@uguam.uog
yzou2002@yahoo.com
...
或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。
它不会使用正则表达式提取电子邮件,尽管它可能是一个很棒的功能。主要区别在于,完成任务比从头开始创建更容易、更快捷。
要集成的代码:
from serpapi import GoogleSearch
import re
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": '"computer science ""usa" "@yahoo.com"',
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
try:
snippet = result['snippet']
except:
snippet = None
match_email = re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', str(snippet))
email = '\n'.join(match_email).strip()
print(email)
---------
'''
shaikotweb@yahoo.com
ahmed_733@yahoo.com
RPeterson@L1id.com
rj_peterson@yahoo.com
'''
<块引用>
免责声明,我为 SerpApi 工作。