Linkedin网络抓取片段

时间:2020-05-10 11:24:23

标签: python-3.x web-scraping linkedin-api

我正在做一个网络抓取数据大学的研究项目。我开始研究一个现成的GitHub项目,但该项目无法检索所有数据。

该项目的工作方式如下:

  1. 使用关键字搜索Google:示例:(会计“向我发送电子邮件给Google”

  2. 提取代码段。

  3. 从此代码段中检索数据。

问题是:

摘录的摘录如下:“ ... 2009年营销部门。有关我们公司的职业机会的更多信息,请给我发送电子邮件:vicki@productivedentist.com。Neighborhood Smiles,LLC ...”

该代码段无法全部显示,“ ...”隐藏了角色,位置等信息。如何使用脚本检索所有信息?

from googleapiclient.discovery import build                     #For using Google Custom Search Engine API
import datetime as dt                                           #Importing system date for the naming of the output file.
import sys                                                      
from xlwt import Workbook                                       #For working on xls file.
import re                                                       #For email search using regex.
if __name__ == '__main__':
    # Create an output file name in the format "srch_res_yyyyMMdd_hhmmss.xls in output folder"
    now_sfx = dt.datetime.now().strftime('%Y%m%d_%H%M%S')
    output_dir = './output/'
    output_fname = output_dir + 'srch_res_' + now_sfx + '.xls'

    search_term = sys.argv[1]
    num_requests = int(sys.argv[2])

    my_api_key = "replace_with_you_api_key"                 #Read readme.md to know how to get you api key.
    my_cse_id = "011658049436509675749:gkuaxghjf5u"         #Google CSE which searches possible LinkedIn profile according to query.
    service = build("customsearch", "v1", developerKey=my_api_key)
    wb=Workbook()
    sheet1 = wb.add_sheet(search_term[0:15])
    wb.save(output_fname)
    sheet1.write(0,0,'Name')
    sheet1.write(0,1,'Profile Link')
    sheet1.write(0,2,'Snippet')
    sheet1.write(0,3,'Present Organisation')
    sheet1.write(0,4,'Location')
    sheet1.write(0,5,'Role')
    sheet1.write(0,6,'Email')
    sheet1.col(0).width = 256 * 20
    sheet1.col(1).width = 256 * 50
    sheet1.col(2).width = 256 * 100
    sheet1.col(3).width = 256 * 20
    sheet1.col(4).width = 256 * 20
    sheet1.col(5).width = 256 * 50
    sheet1.col(6).width = 256 * 50
    wb.save(output_fname)
    row = 1 #To insert the data in the next row.
    #Function to perform google search.
    def google_search(search_term, cse_id, start_val, **kwargs):
        res = service.cse().list(q=search_term, cx=cse_id, start=start_val, **kwargs).execute()
        return res
    for i in range(0, num_requests):
        # This is the offset from the beginning to start getting the results from
        start_val = 1 + (i * 10)
        # Make an HTTP request object
        results = google_search(search_term,
            my_cse_id,
            start_val,
            num=10 #num value can be 1 to 10. It will give the no. of results. 
        )
        for profile in range (0, 10):
            snippet = results['items'][profile]['snippet']
            myList = [item for item in snippet.split('\n')]
            newSnippet = ' '.join(myList)
            contain = re.search(r'[\w\.-]+@[\w\.-]+', newSnippet)
            if contain is not None:
                title = results['items'][profile]['title']
                link = results['items'][profile]['link']
                org = "-NA-"
                location = "-NA-"
                role = "-NA-"
                if 'person' in results['items'][profile]['pagemap']:
                    if 'org' in results['items'][profile]['pagemap']['person'][0]:
                        org = results['items'][profile]['pagemap']['person'][0]['org']
                    if 'location' in results['items'][profile]['pagemap']['person'][0]:
                        location = results['items'][profile]['pagemap']['person'][0]['location']
                    if 'role' in results['items'][profile]['pagemap']['person'][0]:
                        role = results['items'][profile]['pagemap']['person'][0]['role']
                print(title[:-23])
                sheet1.write(row,0,title[:-23])
                sheet1.write(row,1,link)
                sheet1.write(row,2,newSnippet)
                sheet1.write(row,3,org)
                sheet1.write(row,4,location)
                sheet1.write(row,5,role)
                sheet1.write(row,6,contain[0])
                print('Wrote {} search result(s)...'.format(row))
                wb.save(output_fname)
                row = row + 1

    print('Output file "{}" written.'.format(output_fname))

0 个答案:

没有答案