来自Google结果的Python Scrape链接

时间:2019-01-22 06:43:57

标签: python beautifulsoup

有什么办法可以从Google搜索结果中抓取某些包含链接中特定单词的链接。 通过使用beautifulsoup或硒?

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups"
r = requests.get(URL) 

soup = BeautifulSoup(r.content, 'html5lib') 

要提取包含组链接的链接。

1 个答案:

答案 0 :(得分:0)

不确定要做什么,但是如果要从返回的内容中提取Facebook链接,则只需检查facebook.com是否在URL内:

import requests 
from bs4 import BeautifulSoup 
import csv 
URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups" 
r = requests.get(URL) 
soup = BeautifulSoup(r.text, 'html5lib')
for link in soup.findAll('a', href=True): 
    if 'facebook.com' in link.get('href'):
        print link.get('href')

更新: 还有另一种解决方法。您需要做的是设置一个合法的用户代理。因此,添加标题以模拟浏览器。 :

# This is a standard user-agent of Chrome browser running on Windows 10
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}

示例:

from bs4 import BeautifulSoup 
import requests 
URL = 'https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get(URL, headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser')
for link in soup.findAll('a', href=True): 
    if 'facebook.com' in link.get('href'):
        print link.get('href')

此外,您可以添加另一组标题以伪装成合法的浏览器。添加一些其他标题,如下所示:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Accept' : 
    'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language' : 'en-US,en;q=0.5',
    'Accept-Encoding' : 'gzip',
    'DNT' : '1', # Do Not Track Request Header
    'Connection' : 'close'
}