如何正确刮取LinkedIn目录

时间:2016-01-15 02:30:46

标签: python-2.7

我正在尝试为研究项目构建一个基本的LinkedIn刮刀,当我试图刮取目录的级别时遇到了挑战。我是一个初学者,我继续运行下面的代码,IDLE返回错误,然后关闭。请参阅下面的代码和错误:

代码:

import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint as pp

PROFILE_URL = "linkedin.com"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

#use this to gather all of the individual links from the second directory page

def get_second_links(pre_section_link):
    response = requests.get(pre_section_link, headers=headers)
    soup = BeautifulSoup(response.content, "lxml")
    column = soup.find("ul", attrs={'class':'column dual-column'})
    second_links = [li.a["href"] for li in column.findAll("li")]
    return second_links


  # use this to gather all of the individual links from the third directory page
def get_third_links(section_link):
    response = requests.get(section_link, headers=headers)
    soup = BeautifulSoup(response.content, "lxml")
    column = soup.find("ul", attrs={'class':'column dual-column'})
    third_links = [li.a["href"] for li in column.findAll("li")]
    return third_links

使用它来构建个人资料链接

def get_profile_link(link):
    response = requests.get(link, headers=headers)
    soup = BeautifulSoup(response.content, "lxml")
    column2 = soup.find("ul", attrs={'class':'column dual-column'})
    profile_links = [PROFILE_URL + li.a["href"] for li in column2.findAll("li")]
    return profile_links


if __name__=="__main__":
    sub_directory = get_second_links("https://www.linkedin.com/directory/people-a-1/")    
    sub_directory = map(get_third_links, sub_directory)
    profiles = get_third_links(sub_directory)    
    profiles = map(get_profile_link, profiles)
    profiles = [item for sublist in fourth_links for item in sublist]
    pp(profiles)

我一直得到的错误: Error Page

1 个答案:

答案 0 :(得分:0)

您需要将https添加到PROFILE_URL

PROFILE_URL = "https://linkedin.com"