在dt dd标签中抓取数据并在其中包含链接

时间:2017-07-05 15:40:14

标签: python python-3.x web-scraping beautifulsoup

实际上我想从网站上抓取数据" https://www.crunchbase.com/organization/ani-technologies#/entity"我的数据存在于dt和dd标签内,并且因为机器人不允许在网站上。所以我通过这种方式保存了页面并在保存的页面上应用了beautifulsoup模块,尽管我在代码中提到了实际的URL

soup = BeautifulSoup(open(r"C:\Users\acer\Desktop\pythonbooks\tam.html").read())
import requests
ctr=1
file=requests.get("https://www.crunchbase.com/organization/ani-technologies#/entity")
soup = BeautifulSoup(file).read()
dl_data = soup.find_all("dd")
for dlitem in dl_data:
    print(ctr,dlitem.string)
    ctr+=1
  

实际输出:

0 3 Acquisitions
1 None
2 Bengaluru, Karnataka
3 Ola is a mobile app for cab booking in India.
4 None
5 None
6 olacab link
7 None
8 December 3, 2010
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
10 media@olacabs.com
11 None

由于有关于内容的超链接的事实,我在几个地方得到了无。在页面" https://www.crunchbase.com/organization/ani-technologies#/entity" “类别”选项卡有5个类别:电子商务,互联网,运输,应用程序和移动,每个类别都连接到超链接,因此我无法获得我想要的文本,即这5个类别。

  

我想要的输出为:

0 3 Acquisitions
1 (All that text (though not important to me))
2 Bengaluru, Karnataka
3 Ola is a mobile app for cab booking in India.
4 (all that text(though not important to me))
==>5 (E-Commerce, Internet, Transportation, Apps, Mobile)(Extremely important)
6 olacab link
7 (all that text(though not important to me))
8 December 3, 2010
9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
10 media@olacabs.com
11 (all that text(though not important to me))

如果我能得到这样的字典,那将是最有帮助的:

{"Headquarters":["Bengaluru,Karnataka"],
 "Description":["Ola is a mobile app for cab booking in India."],
 "Category": ["E-Commerce", "Internet", "Transportation", "Apps", "Mobile"]}

1 个答案:

答案 0 :(得分:0)

  

问题:...我无法得到我想要的文字......如果我能得到字典......

从所有text/href获取<dd><a href=...>text</dd>,汇总为dict,例如:

from collections import OrderedDict
os_dict = OrderedDict()

for div_class in ['definition-list-container', 'details definition-list']:
    divs = soup.find_all("div", class_=div_class)
    key = '?'
    for div in divs:
        for child in div.findChildren():
            if child.name == 'dt':
                key = child.text[:-1]
            if child.name == 'dd':
                if child.select('a[href]'):
                    a_list = child.find_all("a")
                    if key in ['Social:']:
                        os_dict[key] = [a['href'] for a in a_list]
                    elif len(a_list) == 1:
                        os_dict[key] = a_list[0].text
                    else:
                        os_dict[key] = [a.text for a in a_list]
                else:
                    os_dict[key] = child.text

for n, key in enumerate(os_dict, 1):
    print('{:>2}: {:>20}:\t{}'.format(n, key, os_dict[key]))
  

输出

 1:          Acquisition:   3 Acquisitions
 2:  Total Equity Fundin:   ['11 Rounds', '24 Investors']
 3:         Headquarters:   Bengaluru, Karnataka
 4:          Description:   Ola is a mobile app for cab booking in India.
 5:             Founders:   ['Bhavish Aggarwal', 'Ankit Bhati']
 6:           Categories:   ['E-Commerce', 'Internet', 'Transportation', 'Apps', 'Mobile']
 7:              Website:   http://www.olacabs.com
 8:              Social::   ['http://www.facebook.com/olacabs', 'http://twitter.com/olacabs', 'http://www.linkedin.com/company/olacabs-com']
 9:              Founded:   December 3, 2010
10:              Aliases:   ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs
11:              Contact:   media@olacabs.com
12:            Employees:   8 in Crunchbase
  

美丽的汤文档:find-all
  签名:find_all(name,attrs,recursive,string,limit,** kwargs)

dl_data = soup.find_all("dd")
for n, dlitem in enumerate(dl_data, 1):
    if dlitem.select('a[href]'):
        a_text = [a.text for a in dlitem.find_all("a")]
        print('{}: {}'.format(n, a_text))
    else:
        print('{}: {}'.format(n, dlitem.text))

使用Python测试:3.4.2 - bs4:4.6.0