通过Python BeautifulSoup进行网页爬取

时间:2018-08-03 03:03:40

标签: python python-3.x web-scraping beautifulsoup

我只是Python的初学者。

我正在尝试从网站上抓取数据,并设法编写了以下代码。

但是,由于无法获取href标签,因此我不确定如何继续进行操作,因此无法转到每个列表并获取数据。我对HTML标签也不太了解,因此怀疑我没有正确识别这些标签。

这是我的代码:

import requests 
from bs4 import BeautifulSoup

urls = []
for i in range(1,5):
    pages = "https://directory.singaporefintech.org/?p={0}&category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager=0&featured_only=0&feature=1&perpage=20&sort=random".format(i)
    urls.append(pages)

Data = []
for info in urls:
    page = requests.get(info)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('a', attrs ={'class' :'sabai-directory-title'})
    hrefs = [link['href'] for link in links]

上面的代码将href生成为空白列表。 任何帮助将不胜感激!

谢谢!

3 个答案:

答案 0 :(得分:0)

代码很好,您正在寻找的类在那些页面上不存在。例如,检查https://directory.singaporefintech.org/hello-world/?category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager=0&featured_only=0&feature=1&perpage=20&sort=random后,用注释-回复-链接替换sabai-directory-title类,并在添加打印语句时得到结果

答案 1 :(得分:0)

嗨,我对代码做了一些更改:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

urls = []
for i in range(1,5):
    pages = "https://directory.singaporefintech.org"
    urls.append(pages)

Data = []
hrefs = []
for info in urls:
    page = requests.get(info)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('div', attrs ={'class' :'sabai-directory-title'})
    for link in links:
        Data.extend([a['href'].encode('ascii') for a in link.find_all('a', href=True) if a.text])
pprint (Data)

输出:

     ['https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab',
     'https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab',
     'https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab',
     'https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab']

这是您期望的数据输出吗?

希望有帮助!

答案 2 :(得分:0)

您可以使用CSS选择器来剪贴链接。选择器div.sabai-directory-title a将在<a>标签内的类<div>中找到任何sabai-directory-title标签(我更新了URL,您给了我错误页面):

from bs4 import BeautifulSoup
import requests
from pprint import pprint

r = requests.get('https://directory.singaporefintech.org/')
soup = BeautifulSoup(r.text, 'lxml')

hrefs = [a['href'] for a in soup.select('div.sabai-directory-title a')]

pprint(hrefs)

这将打印:

['https://directory.singaporefintech.org/directory/listing/silent-eight',
 'https://directory.singaporefintech.org/directory/listing/incomlend',
 'https://directory.singaporefintech.org/directory/listing/bizgrow',
 'https://directory.singaporefintech.org/directory/listing/makerscut',
 'https://directory.singaporefintech.org/directory/listing/soho-fintech',
 'https://directory.singaporefintech.org/directory/listing/dxmarkets',
 'https://directory.singaporefintech.org/directory/listing/fundrevo',
 'https://directory.singaporefintech.org/directory/listing/money4money',
 'https://directory.singaporefintech.org/directory/listing/onelyst',
 'https://directory.singaporefintech.org/directory/listing/hearti-lab',
 'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
 'https://directory.singaporefintech.org/directory/listing/ceo-1',
 'https://directory.singaporefintech.org/directory/listing/arcadier',
 'https://directory.singaporefintech.org/directory/listing/plmp-fintech-pte-ltd',
 'https://directory.singaporefintech.org/directory/listing/cash-in-asia',
 'https://directory.singaporefintech.org/directory/listing/grc-systems',
 'https://directory.singaporefintech.org/directory/listing/sendexpense',
 'https://directory.singaporefintech.org/directory/listing/jinjerjade',
 'https://directory.singaporefintech.org/directory/listing/hatcher',
 'https://directory.singaporefintech.org/directory/listing/fintech-consortium']