实际上我想从网站上抓取数据" https://www.crunchbase.com/organization/ani-technologies#/entity"我的数据存在于dt和dd标签内,并且因为机器人不允许在网站上。所以我通过这种方式保存了页面并在保存的页面上应用了beautifulsoup模块,尽管我在代码中提到了实际的URL
soup = BeautifulSoup(open(r"C:\Users\acer\Desktop\pythonbooks\tam.html").read())
import requests
ctr=1
file=requests.get("https://www.crunchbase.com/organization/ani-technologies#/entity")
soup = BeautifulSoup(file).read()
dl_data = soup.find_all("dd")
for dlitem in dl_data:
print(ctr,dlitem.string)
ctr+=1
实际输出:
0 3 Acquisitions 1 None 2 Bengaluru, Karnataka 3 Ola is a mobile app for cab booking in India. 4 None 5 None 6 olacab link 7 None 8 December 3, 2010 9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 10 media@olacabs.com 11 None
由于有关于内容的超链接的事实,我在几个地方得到了无。在页面" https://www.crunchbase.com/organization/ani-technologies#/entity" “类别”选项卡有5个类别:电子商务,互联网,运输,应用程序和移动,每个类别都连接到超链接,因此我无法获得我想要的文本,即这5个类别。
我想要的输出为:
0 3 Acquisitions 1 (All that text (though not important to me)) 2 Bengaluru, Karnataka 3 Ola is a mobile app for cab booking in India. 4 (all that text(though not important to me)) ==>5 (E-Commerce, Internet, Transportation, Apps, Mobile)(Extremely important) 6 olacab link 7 (all that text(though not important to me)) 8 December 3, 2010 9 ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 10 media@olacabs.com 11 (all that text(though not important to me))
如果我能得到这样的字典,那将是最有帮助的:
{"Headquarters":["Bengaluru,Karnataka"],
"Description":["Ola is a mobile app for cab booking in India."],
"Category": ["E-Commerce", "Internet", "Transportation", "Apps", "Mobile"]}
答案 0 :(得分:0)
问题:...我无法得到我想要的文字......如果我能得到字典......
从所有text/href
获取<dd><a href=...>text</dd>
,汇总为dict
,例如:
from collections import OrderedDict
os_dict = OrderedDict()
for div_class in ['definition-list-container', 'details definition-list']:
divs = soup.find_all("div", class_=div_class)
key = '?'
for div in divs:
for child in div.findChildren():
if child.name == 'dt':
key = child.text[:-1]
if child.name == 'dd':
if child.select('a[href]'):
a_list = child.find_all("a")
if key in ['Social:']:
os_dict[key] = [a['href'] for a in a_list]
elif len(a_list) == 1:
os_dict[key] = a_list[0].text
else:
os_dict[key] = [a.text for a in a_list]
else:
os_dict[key] = child.text
for n, key in enumerate(os_dict, 1):
print('{:>2}: {:>20}:\t{}'.format(n, key, os_dict[key]))
输出:
1: Acquisition: 3 Acquisitions 2: Total Equity Fundin: ['11 Rounds', '24 Investors'] 3: Headquarters: Bengaluru, Karnataka 4: Description: Ola is a mobile app for cab booking in India. 5: Founders: ['Bhavish Aggarwal', 'Ankit Bhati'] 6: Categories: ['E-Commerce', 'Internet', 'Transportation', 'Apps', 'Mobile'] 7: Website: http://www.olacabs.com 8: Social:: ['http://www.facebook.com/olacabs', 'http://twitter.com/olacabs', 'http://www.linkedin.com/company/olacabs-com'] 9: Founded: December 3, 2010 10: Aliases: ANI Technologies Pvt Ltd, Olacabs.com, Ola Cabs, Olacabs 11: Contact: media@olacabs.com 12: Employees: 8 in Crunchbase
美丽的汤文档:find-all
签名:find_all(name,attrs,recursive,string,limit,** kwargs)
dl_data = soup.find_all("dd")
for n, dlitem in enumerate(dl_data, 1):
if dlitem.select('a[href]'):
a_text = [a.text for a in dlitem.find_all("a")]
print('{}: {}'.format(n, a_text))
else:
print('{}: {}'.format(n, dlitem.text))
使用Python测试:3.4.2 - bs4:4.6.0