基本上我的意思是,当我搜索this时,第一个结果的href 属性是google.com/url重定向。现在,如果我只是用我的浏览器浏览互联网,我不介意这个,但我想在python中获取搜索结果。所以对于这段代码:
import requests
from bs4 import BeautifulSoup
def get_web_search(query):
query = query.replace(' ', '+') # Replace with %20 also works
response = requests.get('https://www.google.com/search', params={"q":
query})
r_data = response.content
soup = BeautifulSoup(r_data, 'html.parser')
result_raw = []
results = []
for result in soup.find_all('h3', class_='r', limit=1):
result_raw.append(result)
for result in result_raw:
results.append({
'url' : result.find('a').get('href'),
'text' : result.find('a').get_text()
})
print(results)
get_web_search("turtles")
我希望
[{ 网址:“https://www.google.com/search?q=turtles”, 文字:“海龟 - 维基百科” }]
但我得到的是
[{'url':'/ url?q = https://en.wikipedia.org/wiki/Turtle','text':'Turtle - Wikipedia'}
我在这里缺少什么吗?我是否需要提供不同的标头或其他请求参数?任何帮助表示赞赏。谢谢。
注意:我看过其他关于此的帖子,但我是初学者,所以我无法理解那些因为它们不在python中
答案 0 :(得分:1)
只需点击链接的重定向,它就会转到右侧页面。假设您的链接位于url
变量中。
import urllib2
url = "/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN"
url = "www.google.com"+url
response = urllib2.urlopen(url) # 'www.google.com/url?q=https://en.wikipedia.org/wiki/Turtle&sa=U&ved=0ahUKEwja-oaO7u3XAhVMqo8KHYWWCp4QFggVMAA&usg=AOvVaw31hklS09NmMyvgktL1lrTN'
response.geturl() # 'https://en.wikipedia.org/wiki/Turtle'
这是有效的,因为您回来了谷歌重定向到网址,这是您每次搜索时实际点击的内容。这段代码基本上只是跟着重定向,直到它到达真正的URL。
答案 1 :(得分:0)
使用此提供Google搜索的包
答案 2 :(得分:0)
你可以使用selenium结合python和BeautifulSoup来做同样的事情。无论网页是启用javascript还是普通网页,它都会为您提供第一个结果:
from selenium import webdriver
from bs4 import BeautifulSoup
def get_data(search_input):
search_input = search_input.replace(" ","+")
driver.get("https://www.google.com/search?q=" + search_input)
soup = BeautifulSoup(driver.page_source,'lxml')
for result in soup.select('h3.r'):
item = result.select("a")[0].text
link = result.select("a")[0]['href']
print("item_text: {}\nitem_link: {}".format(item,link))
break
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_data("turtles")
finally:
driver.quit()
输出:
item_text: Turtle - Wikipedia
item_link: https://en.wikipedia.org/wiki/Turtle
答案 3 :(得分:0)
您可以使用 CSS
选择器来获取这些链接。
soup.select_one('.yuRUbf a')['href']
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=turtles', headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
# iterates over organic results container
for result in soup.select('.tF2Cxc'):
# extracts url from "result" container
url = result.select_one('.yuRUbf a')['href']
print(url)
------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.worldwildlife.org/species/sea-turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://www.fisheries.noaa.gov/sea-turtles
https://www.fisheries.noaa.gov/species/green-turtle
https://turtlesurvival.org/
https://www.outdooralabama.com/reptiles/turtles
https://www.rewild.org/lost-species/lost-turtles
'''
或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 做同样的事情。
这是一个付费 API,可免费试用 5,000 次搜索,这里的主要区别在于,您只需浏览结构化的 JSON
,而不是找出某些东西不起作用的原因。
要集成的代码:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "turtle",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(result['link'])
--------------
'''
https://en.wikipedia.org/wiki/Turtle
https://www.britannica.com/animal/turtle-reptile
https://www.britannica.com/story/whats-the-difference-between-a-turtle-and-a-tortoise
https://turtlesurvival.org/
https://www.worldwildlife.org/species/sea-turtle
https://www.conserveturtles.org/
'''
<块引用>
免责声明,我为 SerpApi 工作。