使用beautifulsoup在Kickstarter上刮刮多个网址

时间:2018-05-08 19:56:04

标签: python html web-scraping beautifulsoup

我正试图从kickstarter网站获取有关每个项目每类奖励的支持者数量的数据。 我的输入是一个网址列表。

我遇到了很多问题:

  1. 我无法从txt文件中加载网址,但只有当列表位于python中时,我的代码才有效。

    我试图抓取的数据示例是这样的(实际上下面是我在下面的代码中从2个链接中抓取的一段数据):

    ..... ..... , 承诺CA $ 500或以上 约€325 , 承诺2,500美元或以上 约€1,625 , 5个支持者 , 2个支持者 , ........ .......

  2. 我需要在每个项目的CSV文件的行中写上面显示的结果。因此,CSV文件的第一个单元格将是项目链接(或标题,如果使用beautifulsoup进行删除);第二列应该是一系列值,其中包含每个商品的支持者数量。使用上述数据的示例:

  3. "project link" , "pledge__backer-count" 5 backers , "pledge__amount" $500

    我正在努力处理来自网址列表的代码部分。第一部分是从网络上的一个例子中复制而来的,效果很好。 在此先感谢您的帮助,我的论文确实需要这个。 """

    from requests import get
    from requests.exceptions import RequestException
    from contextlib import closing
    from bs4 import BeautifulSoup
    import re
    
    def simple_get(url):
        """
        Attempts to get the content at `url` by making an HTTP GET request.
        If the content-type of response is some kind of HTML/XML, return the
        text content, otherwise return None
        """
        try:
            with closing(get(url, stream=True)) as resp:
                if is_good_response(resp):
                    return resp.content
                else:
                    return None
    
        except RequestException as e:
            log_error('Error during requests to {0} : {1}'.format(url, str(e)))
            return None
    
    def is_good_response(resp):
        """
        Returns true if the response seems to be HTML, false otherwise
        """
        content_type = resp.headers['Content-Type'].lower()
        return (resp.status_code == 200 
                and content_type is not None 
                and content_type.find('html') > -1)
    
    def log_error(e):
        """
        It is always a good idea to log errors. 
        This function just prints them, but you can
        make it do anything.
        """
        print(e)
    
    
    
    urls=['https://www.kickstarter.com/projects/socialismmovie/socialism-an-american-story?ref=home_potd','https://www.kickstarter.com/projects/1653847368/the-cuban-a-film-about-the-power-of-music-over-alz?ref=home_new_and_noteworthy']
    
    
    for url in urls:
            Project_raw=simple_get(url)
            Project_bs4= BeautifulSoup(Project_raw, 'lxml')
    
            Backers_offers=Project_bs4.find_all("h2", class_="pledge__amount")
            Backers_per_offer=Project_bs4.find_all("span", class_="pledge__backer-count")
            Offers_plus_Backers=Backers_offers+Backers_per_offer
    
            print(Offers_plus_Backers)
    

0 个答案:

没有答案