如何从hrefs获取hrefs?

时间:2018-11-23 09:54:50

标签: python web-scraping beautifulsoup

如何使用类和方法格式的Python从hrefs中获取hrefs? 我尝试过:

root_url = 'https://www.iea.org'

class IEAData:
       def __init__(self):
             try:--
             except:


       def get_links(self, url):
            all_links = []
            page = requests.get(root_url)
            soup = BeautifulSoup(page.text, 'html.parser')
            for href in soup.find_all(class_='omrlist'):
               all_links.append(root_url + href.find('a').get('href'))
            return all_links
            #print(all_links)

iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')

reportLinks = []

for url in yearLinks:
    links =iea_obj.get_links(yearLinks)
    print(links)

推荐:links变量必须具有所有月份的href,但没有获取,所以请告诉我该怎么做。

2 个答案:

答案 0 :(得分:0)

您的代码有几个问题。您的get_links()函数未使用传递给它的url。当循环返回的链接时,您传递的是yearLinks而不是url

以下内容将助您一臂之力:

from bs4 import BeautifulSoup                        
import requests

root_url = 'https://www.iea.org'

class IEAData:
    def get_links(self, url):
        all_links = []
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')

        for li in soup.find_all(class_='omrlist'):
           all_links.append(root_url + li.find('a').get('href'))
        return all_links

iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')

for url in yearLinks:
    links = iea_obj.get_links(url)
    print(url, links)

这将使您开始输出:

https://www.iea.org/oilmarketreport/reports/2018/ ['https://www.iea.org/oilmarketreport/reports/2018/0118/', 'https://www.iea.org/oilmarketreport/reports/2018/0218/', 'https://www.iea.org/oilmarketreport/reports/2018/0318/', 'https://www.iea.org/oilmarketreport/reports/2018/0418/', 'https://www.iea.org/oilmarketreport/reports/2018/0518/', 'https://www.iea.org/oilmarketreport/reports/2018/0618/', 'https://www.iea.org/oilmarketreport/reports/2018/0718/', 'https://www.iea.org/oilmarketreport/reports/2018/0818/', 'https://www.iea.org/oilmarketreport/reports/2018/1018/']
https://www.iea.org/oilmarketreport/reports/2017/ ['https://www.iea.org/oilmarketreport/reports/2017/0117/', 'https://www.iea.org/oilmarketreport/reports/2017/0217/', 'https://www.iea.org/oilmarketreport/reports/2017/0317/', 'https://www.iea.org/oilmarketreport/reports/2017/0417/', 'https://www.iea.org/oilmarketreport/reports/2017/0517/', 'https://www.iea.org/oilmarketreport/reports/2017/0617/', 'https://www.iea.org/oilmarketreport/reports/2017/0717/', 'https://www.iea.org/oilmarketreport/reports/2017/0817/', 'https://www.iea.org/oilmarketreport/reports/2017/0917/', 'https://www.iea.org/oilmarketreport/reports/2017/1017/', 'https://www.iea.org/oilmarketreport/reports/2017/1117/', 'https://www.iea.org/oilmarketreport/reports/2017/1217/']

答案 1 :(得分:0)

我对编程还很陌生,但我仍在学习并试图了解类和其他所有东西如何协同工作。但是试一试(这就是我们的学习方法,对吗?)

不确定这是否是您要寻找的输出。我更改了2件事,并且能够将yearLinks中的所有链接放入列表中。请注意,它还将包括PDF链接以及我认为您想要的months链接。如果您不想要这些PDF链接,而又不想只使用月份,则不要包含pdf。

这是我使用的代码,也许您可​​以使用它来适应其结构。

root_url = 'https://www.iea.org'


class IEAData:

    def get_links(self, url):

       all_links = []
       page = requests.get(url)
       soup = bs4.BeautifulSoup(page.text, 'html.parser')
       for href in soup.find_all(class_='omrlist'):
           all_links.append(root_url + href.find('a').get('href'))
       return all_links
       #print(all_links)


iea_obj = IEAData()
yearLinks = iea_obj.get_links(root_url + '/oilmarketreport/reports/')

reportLinks = []

for url in yearLinks:
    links = iea_obj.get_links(url)

    # uncomment line below if you do not want the .pdf links
    #links = [ x for x in links if ".pdf" not in x ]
    reportLinks += links