网页用美丽的汤抓多页

时间:2016-02-06 21:23:48

标签: python-3.x web-scraping beautifulsoup

我正在编写一些代码来从众芯片中删除一些数据。

我们的想法是获取信息标题,描述,目标资本,募集资金和类别

首先,我尝试了一个页面。代码有效。这是:

from bs4 import BeautifulSoup
import urllib, re

data = {
        'title' : [],
        'description' : [],
        'target' : [],
        'raised':[],
        'category' : []
}

l=urllib.request.urlopen('https://www.crowdcube.com/investment/primo-18884')
    tree= BeautifulSoup(l, 'lxml')

#title
    title=tree.find_all('div',{'class':'cc-pitch__title'})

    data['title'].append(title[0].find('h2').get_text())    


#description
    description=tree.find_all('div',{'class':'fullwidth'})

    data['description'].append(description[1].find('p').get_text())

#target

    target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'})

    data['target'].append(target[0].find('dd').get_text())

#raised

    raised=tree.find_all('div',{'class':'cc-pitch__raised'})

    data['raised'].append(raised[0].find('b').get_text())


#category

    category=tree.find_all('li',{'class':'sectors'})

    data['category'].append(category[0].find('span').get_text() )

data

我需要从网站上的所有项目中下载相同的信息。

此页面中包含所有链接:(https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7

为此,我开始使用以下代码创建URL列表:

source= urllib.request.urlopen('https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7')

get_link= BeautifulSoup(source, 'lxml')

links_page = [a.attrs.get('href') for a in get_link.select('a[href]')]

links_page = list(set(links_page)) #drops duplicates
links = [l for l in links_page if 'https://www.crowdcube.com/investment/' in l] # drop corrupted links

这是我从该代码中获得的链接示例:

 ['https://www.crowdcube.com/investment/floodkit-16516', 
'https://www.crowdcube.com/investment/east-end-manufacturing-14667', 
'https://www.crowdcube.com/investment/wrap-it-up-18021']

一旦有了这个列表,我想用上面相同的代码运行for循环。因此:

for link in links:
    l=urllib.request.urlopen(link)
    tree= BeautifulSoup(l, 'lxml')


#title
    title=tree.find_all('div',{'class':'cc-pitch__title'})

    data['title'].append(title[0].find('h2').get_text())    

#description
    description=tree.find_all('div',{'class':'fullwidth'})

    data['description'].append(description[1].find('p').get_text())

#target

    target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'})

    data['target'].append(target[0].find('dd').get_text())

#raised

    raised=tree.find_all('div',{'class':'cc-pitch__raised'})

    data['raised'].append(raised[0].find('b').get_text())


#category

    category=tree.find_all('li',{'class':'sectors'})

    data['category'].append(category[0].find('span').get_text() )

data 

这不起作用。我尝试了一切只是为了看到第一次迭代时创建的树,这是空的。

问题可能与这些链接是字符串的事实有关吗?

1 个答案:

答案 0 :(得分:0)

您链接到的页面上有三个以上的链接,我得到292,如果您要解析每个链接,请执行以下操作:

import requests
from bs4 import BeautifulSoup

url = "https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7"


def parse(so):
    return {'title': soup.title.text, 'description': so.find("div", {"class": "pitch-tabs"}).p.text,
            'target': so.find("div",{"class":"cc-pitch__stats clearfix"}).dd.text,
            'raised': so.find("div", {"class": "cc-pitch__raised"}).b.text,
            'category': " ".join(so.find("li",{"class":"sectors"}).span.text.split()),
            "title": so.title.text}


req = requests.get(url)

soup = BeautifulSoup(req.content)

links = {h.a["href"] for h in soup.find_all("h2", {"class": "pitch__title"})}

for link in links:
    print(link)
    soup = BeautifulSoup(requests.get(link).content)
    print(parse(soup))

输出片段:

https://www.crowdcube.com/investment/property-moose-14045
{'category': u'Other, Internet Business, Technology', 'raised': u'\xa3169,010', 'target': u'\xa360,000', 'description': u'Property Moose is a new generation of property investment \u2013 taking the equity crowdfunding model and using it to allow users to invest in a wide range of properties from only \xa3500. Combining this with a fully integrated online platform, Property Moose aspires to take the Crowdfunding revolution by storm.', 'title': u'Property Moose raising \xa360,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/easyproperty-com-16655
{'category': u'Professional and Business Services, Internet Business', 'raised': u'\xa31,358,680', 'target': u'\xa31,000,000', 'description': u'easyProperty, the latest company from easyGroup, will offer individually priced property services. The venture, which has been founded by Sir Stelios (founder of easyJet) and Robert Ellice (a property entrepreneur with 20 years\u2019 experience), has been described by the FT as \u201ceasily the biggest brand name yet to enter the online estate agent business\u201d.', 'title': u'easyProperty.com raising \xa31,000,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/universal-fuels-phase-1-10466
{'category': u'Oil & Gas', 'raised': u'\xa3100,000', 'target': u'\xa3100,000', 'description': u'Universal Fuels Ltd is just over 2 years old, we supply diesel, petrol, lubricants and kerosene UK wide to homes, petrol stations, transport companies, construction firms and a range of other businesses. We have just been\u2026', 'title': u'Universal Fuels Phase 1 raising \xa3100,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/stakis-daycare-nurseries-ltd-12468
{'category': u'Education, Other', 'raised': u'\xa3101,230', 'target': u'\xa3100,000', 'description': u'Stakis Daycare Nurseries is a new franchise provider of daycare nurseries in the UK.', 'title': u'Stakis Daycare Nurseries Ltd raising \xa3100,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/bidstack-20749
{'category': u'Media and Creative Services, Internet Business, Technology', 'raised': u'\xa3138,970', 'target': u'\xa3100,000', 'description': u"Bidstack is a live bidding platform for last-minute digital advertising signage, aiming to make digital out of home advertising truly accessible for anyone. Bidstack launched their video at the O2 arena, raising brand awareness as the first steps to disrupt a growing \xa3multi-billion industry. The team's experience includes a \xa3multi-million business exit and a successfully overfunded Crowdcube campaign.", 'title': u'BidStack raising \xa3100,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/e-sign-14248
{'category': u'Internet Business', 'raised': u'\xa364,760', 'target': u'\xa350,000', 'description': u'E-Sign offers our clients a secure, advanced electronic signature solution to enable important documents to be signed, when required, by any person, anywhere, at any time. Traditional hand written signatures on documents can be expensive, time consuming and provide an opportunity for the signature to be forged. E-Sign allows companies to conclude business more rapidly, whilst reducing their running costs and combating fraud.', 'title': u'E-Sign raising \xa350,000 investment on Crowdcube. Capital At Risk.'}