webscraping并将结果保存为json

时间:2017-04-26 17:58:54

标签: python web-scraping beautifulsoup web-crawler

我想以这种方式用beautifulsoup刮掉网站:

  1. 在主页上有40个类别名称

  2. 然后转到每个类别,例如(startupstash.com/ideageneration/),并且会有一些子类别

  3. 现在转到每个子类别,假设第一个是startupstash.com/resource/milanote/并获取内容详细信息

  4. 4.这同样适用于所有40个类别+子类别数量+每个子类别详细信息。

    请有人可以提供一个想法如何接近..或方法与beautifulsoup ..或可能的代码..我尝试了一些东西

    import requests
    from bs4 import BeautifulSoup
    headers={'User-Agent':'Mozilla/5.0'}
    
    
    base_url="http://startupstash.com/"
    req_home_page=requests.get(base_url,headers=headers)
    soup=BeautifulSoup(req_home_page.text, "html5lib")
    links_tag=soup.find_all('li', {'class':'categories-menu-item'})
    titles_tag=soup.find_all('span',{'class':'name'})
    links,titles=[],[]
    
    for link in links_tag:
        links.append(link.a.get('href'))
    #print(links)
    for title in titles_tag:
        titles.append(title.getText())
    print("HOME PAGE TITLES ARE \n",titles)                                                              
    #HOME PAGE RESULT TITLE FINISH HERE
    
    for i in range(0,len(links)):
        req_inside_page = requests.get(links[i],headers=headers)
        page_store =BeautifulSoup(req_inside_page.text, "html5lib")
        jump_to_next=page_store.find_all('div', { 'class' : 'company-listing more' })
        nextlinks=[]
        for div in jump_to_next:
            nextlinks.append(div.a.get("href"))
        print("DETAIL OF THE LINKS IN EVERY CATEGORIES SCRAPPED HERE \n",nextlinks)                     #SCRAPPED THE WEBSITES IN EVERY CATEGORIES
    
        for j in range(0,len(nextlinks)):
            req_final_page=requests.get(nextlinks[j],headers=headers)
            page_stored=BeautifulSoup(req_final_page.text,'html5lib')
            detail_content=page_stored.find('div', { 'class' : 'company-page-body body'})
            details,website=[],[]
            for content in detail_content:
            details.append(content.string)
            print("DESCRIPTION ABOUT THE WEBSITE \n",details)                                       #SCRAPPED THE DETAILS OF WEBSITE 
    
    
            detail_website=page_stored.find('div',{'id':"company-page-contact-details"})
            table=detail_website.find('table')
            for tr in table.find_all('tr')[2:]:
                tds=tr.find_all('td')[1:]
                for td in tds:
                    website.append(td.a.get('href'))
                    print("VISIT THE WEBSITE \n",website)
    

1 个答案:

答案 0 :(得分:0)

好的,首先你需要在标题中添加'User-Agent'来模拟网页浏览器(请不要滥用网站)。
然后,您可以使用以下行从第一页中提取链接:

links = [ li.a.get('href') for li in soup.find_all('li', {'class':'categories-menu-item'}) ] 

然后遍历这些链接并获取每个链接的链接:

links = [ div.a.get('href') for div in soup.find_all('div', { 'class' : 'company-listing-more' }) ]

最后得到内容:

content = soup.find('div', { 'class' : 'company-page-body body'}).text