从谷歌到BeautifulSoup的精确网站链接

时间:2017-07-08 17:34:52

标签: python beautifulsoup

我想使用BeautifulSoup搜索谷歌并打开第一个链接。但是,当我打开链接时,它显示错误。我认为是因为谷歌没有提供网站的确切链接,它在网址中添加了几个参数。如何获得确切的网址?

当我尝试使用引用标记时,它可以正常工作,但是对于大网址而言,它会产生问题。

我使用soup.h3.a [' href'] [7:]的第一个链接是: ' http://www.wikipedia.com/wiki/White_holes&sa=U&ved=0ahUKEwi_oYLLm_rUAhWJNI8KHa5SClsQFggbMAI&usg=AFQjCNGN-vlBvbJ9OPrnq40d0_b8M0KFJQ'

这是我的代码:

import requests
from bs4 import Beautifulsoup
r = requests.get('https://www.google.com/search?q=site:wikipedia.com+Black+hole&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw')
soup = BeautifulSoup(r.text, "html.parser")
print(soup.h3.a['href'][7:])

2 个答案:

答案 0 :(得分:1)

您可以拆分返回的字符串:

url = soup.h3.a['href'][7:].split('&')
print(url[0])

答案 1 :(得分:0)

希望通过上面提到的所有答案,你的代码看起来像 这样:

from bs4 import BeautifulSoup
import requests
import csv
import os
import time

url = "https://www.google.co.in/search?q=site:wikipedia.com+Black+hole&dcr=0&gbv=2&sei=Nr3rWfLXMIuGvQT9xZOgCA"
r = requests.get(url)
data = r.text

url1 = "https://www.google.co.in"

soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("div", attrs={"class":"g"})
final_data = []
for details in get_details:
    link = details.find_all("h3")
    #links = ""
    for mdetails in link:
        links = mdetails.find_all("a")
        lmk = ""
        for lnk in links:
            lmk = lnk.get("href")[7:].split("&")
            sublist = []
            sublist.append(lmk[0])
        final_data.append(sublist)

filename = "Google.csv"
with open("./"+filename, "w")as csvfile:
    csvfile = csv.writer(csvfile, delimiter=",")
    csvfile.writerow("")
    for i in range(0, len(final_data)):
        csvfile.writerow(final_data[i])