如何从bs4.element.Tag列表中创建一个单独的元素 - 字典?

时间:2017-12-27 06:24:02

标签: python html web-scraping beautifulsoup

我已经废弃了一个网页,我在其中保存了defaultdict名为ecg_machines['City'] = []的{​​{1}}个<li> - 元素作为列表中的元素且属于type <class 'bs4.element.Tag'>。例如,我有一个元素ecg_machines['Delhi'][0] =

<li class="lst lst_cl mft2 img llc " data-catid="" data-city="New Delhi" data-csttypnm="LEADER" data-csttypwgt="149" data-dispid="3374118533" data-glid="4291542" data-mcatid="" data-modref="2" data-state="Delhi" id="LST1"><div class="clz"><p class="pnt"><a class="pnm ldf cur" href="http://www.goodhealthinc.in/digital-ecg-machine.html#digital-3-channel-ecg-machine" target="_blank" title="">Digital 3 Channel ECG Machine</a></p><a class="ribn NP-1 " id="1imgenq"><div class="bg rib"></div><div class="nor_i imwd"><img alt="Digital 3 Channel ECG Machine" data-limg="//4.imimg.com/data4/JC/MY/MY-4291542/digital-3-channel-ecg-machine-500x500.jpg" id="bu1" src="//4.imimg.com/data4/JC/MY/MY-4291542/digital-3-channel-ecg-machine-250x250.jpg"/></div></a><div class="lstw"><div class="ldc"><p class="desc des_p" id="trimmed_desc1"><b>Application</b>:  Resting &amp; Diagnostic<br><b>Brand</b>:  Allied Medical<br><b>Types</b>:  3-Lead,  12-Lead<br><b>Operation Mode</b>:  Portable<br><b>Channel</b> <a class="wh mlin" href="http://www.goodhealthinc.in/digital-ecg-machine.html#digital-3-channel-ecg-machine" target="_blank"> more..</a></br></br></br></br></p><div class="prc NP-1" id="1prcenq">Rs 31,500/<span class="quan"> Piece</span></div></div></div></div><div class="spro"><div class="nsim"><span><img alt="Digital ECG Machine" data-modref="2" data-prc="Rs 31,500/ Unit" data-slimg="//4.imimg.com/data4/PI/DK/MY-4291542/digital-ecg-machine-500x500.jpg" id="rp1_1" src="//4.imimg.com/data4/PI/DK/MY-4291542/digital-ecg-machine-125x125.jpg" title="Digital ECG Machine"/></span></div><div class="nsim"><span><img alt="Automatic Digital ECG Machine" data-modref="2" data-prc="Rs 68,000/ Unit" data-slimg="//4.imimg.com/data4/PC/RT/MY-4291542/automatic-digital-ecg-machine-500x500.jpg" id="rp1_2" src="//4.imimg.com/data4/PC/RT/MY-4291542/automatic-digital-ecg-machine-125x125.jpg" title="Automatic Digital ECG Machine"/></span></div><div class="nsim"><span><img alt="Digital Twelve Channel ECG Machine" data-modref="2" data-prc="Rs 68,000/ Piece" data-slimg="//4.imimg.com/data4/TA/NW/MY-4291542/digital-twelve-channel-ecg-machine-500x500.jpg" id="rp1_3" src="//4.imimg.com/data4/TA/NW/MY-4291542/digital-twelve-channel-ecg-machine-125x125.jpg" title="Digital Twelve Channel ECG Machine"/></span></div></div><div class="nes"><span class="cnm cmcl"><span><a class="lcname" href="http://www.goodhealthinc.in/" target="_blank">Goodhealth Inc.</a></span><span class="vcom"></span><span class="td t_v cur"><span class="bg t_se" data-val="goodhealth"><span class="ttl wd2 doff" id="tool-tip_n1"><span class="twrw"><span></span></span><span class="t_ts"> </span></span></span></span></span><div class="clg" data-rlocation="Jhandewalan">Jhandewalan, New Delhi<span class="srad cty-t" id="citytt1"><span>201, D. D. A. Commercial Complex, C. M. 1 Jhandewalan Extension,<br> </br></span><span class="ct_l">New Delhi </span><span> - </span><span>110055</span><span>, </span><span>Delhi</span><span><br/></span></span></div></div></li>

读起来像

数字3通道心电图机

应用:休息&amp;诊断品牌:联合医疗类型:3导联,12导联操作模式:便携式频道更多..

Rs 31,500 / PieceGoodhealth Inc. Jhandewalan,新德里201,DDA商业综合体,CM 1 Jhandewalan Extension,新德里 - 110055,德里

如何从ecg_machines['Delhi'][0] = {'Name':'Digital 3 Channel ECG Machine', 'Application': 'Resting & Diagnostic', 'Brand': 'Allied Medical', 'Types': '3-Lead, 12-Lead', 'Operation Mode': 'Portable', 'Price':'Rs 31,500/ Piece'...}

创建一个字典

1 个答案:

答案 0 :(得分:0)

您必须使用find()find_all()selec()select_one()(等)来获取bs4.element.Tag

中的元素
from bs4 import BeautifulSoup

html = '<li class="lst lst_cl mft2 img llc " data-catid="" data-city="New Delhi" data-csttypnm="LEADER" data-csttypwgt="149" data-dispid="3374118533" data-glid="4291542" data-mcatid="" data-modref="2" data-state="Delhi" id="LST1"><div class="clz"><p class="pnt"><a class="pnm ldf cur" href="http://www.goodhealthinc.in/digital-ecg-machine.html#digital-3-channel-ecg-machine" target="_blank" title="">Digital 3 Channel ECG Machine</a></p><a class="ribn NP-1 " id="1imgenq"><div class="bg rib"></div><div class="nor_i imwd"><img alt="Digital 3 Channel ECG Machine" data-limg="//4.imimg.com/data4/JC/MY/MY-4291542/digital-3-channel-ecg-machine-500x500.jpg" id="bu1" src="//4.imimg.com/data4/JC/MY/MY-4291542/digital-3-channel-ecg-machine-250x250.jpg"/></div></a><div class="lstw"><div class="ldc"><p class="desc des_p" id="trimmed_desc1"><b>Application</b>:  Resting &amp; Diagnostic<br><b>Brand</b>:  Allied Medical<br><b>Types</b>:  3-Lead,  12-Lead<br><b>Operation Mode</b>:  Portable<br><b>Channel</b> <a class="wh mlin" href="http://www.goodhealthinc.in/digital-ecg-machine.html#digital-3-channel-ecg-machine" target="_blank"> more..</a></br></br></br></br></p><div class="prc NP-1" id="1prcenq">Rs 31,500/<span class="quan"> Piece</span></div></div></div></div><div class="spro"><div class="nsim"><span><img alt="Digital ECG Machine" data-modref="2" data-prc="Rs 31,500/ Unit" data-slimg="//4.imimg.com/data4/PI/DK/MY-4291542/digital-ecg-machine-500x500.jpg" id="rp1_1" src="//4.imimg.com/data4/PI/DK/MY-4291542/digital-ecg-machine-125x125.jpg" title="Digital ECG Machine"/></span></div><div class="nsim"><span><img alt="Automatic Digital ECG Machine" data-modref="2" data-prc="Rs 68,000/ Unit" data-slimg="//4.imimg.com/data4/PC/RT/MY-4291542/automatic-digital-ecg-machine-500x500.jpg" id="rp1_2" src="//4.imimg.com/data4/PC/RT/MY-4291542/automatic-digital-ecg-machine-125x125.jpg" title="Automatic Digital ECG Machine"/></span></div><div class="nsim"><span><img alt="Digital Twelve Channel ECG Machine" data-modref="2" data-prc="Rs 68,000/ Piece" data-slimg="//4.imimg.com/data4/TA/NW/MY-4291542/digital-twelve-channel-ecg-machine-500x500.jpg" id="rp1_3" src="//4.imimg.com/data4/TA/NW/MY-4291542/digital-twelve-channel-ecg-machine-125x125.jpg" title="Digital Twelve Channel ECG Machine"/></span></div></div><div class="nes"><span class="cnm cmcl"><span><a class="lcname" href="http://www.goodhealthinc.in/" target="_blank">Goodhealth Inc.</a></span><span class="vcom"></span><span class="td t_v cur"><span class="bg t_se" data-val="goodhealth"><span class="ttl wd2 doff" id="tool-tip_n1"><span class="twrw"><span></span></span><span class="t_ts"> </span></span></span></span></span><div class="clg" data-rlocation="Jhandewalan">Jhandewalan, New Delhi<span class="srad cty-t" id="citytt1"><span>201, D. D. A. Commercial Complex, C. M. 1 Jhandewalan Extension,<br> </br></span><span class="ct_l">New Delhi </span><span> - </span><span>110055</span><span>, </span><span>Delhi</span><span><br/></span></span></div></div></li>'

result = {}

html = html.replace('<br>', '\n')
soup = BeautifulSoup(html)

#print(soup.prettify())

name = soup.find('a', {'class': "pnm ldf cur"}).text
result['Name'] = name
print('Name:', name)
print('---')

items = soup.find('p', {'id': "trimmed_desc1"}).text
items = items.strip().split('\n')
for item in items:
    #print(item)
    parts = item.split(':')
    if len(parts) > 1:
        key = parts[0].strip()
        value = parts[1].strip()
        result[key] = value
        print('key:', key)
        print('value:', value)
        print('---')

price = soup.find('div', {'id': "1prcenq"}).text
result['Price'] = price
print('price:', price)
print('---')

print('result:', result)
print('---')

结果:

Name: Digital 3 Channel ECG Machine
---
key: Application
value: Resting & Diagnostic
---
key: Brand
value: Allied Medical
---
key: Types
value: 3-Lead,  12-Lead
---
key: Operation Mode
value: Portable
---
price: Rs 31,500/ Piece
---
result: {'Name': 'Digital 3 Channel ECG Machine', 'Application': 'Resting & Diagnostic', 'Brand': 'Allied Medical', 'Types': '3-Lead,  12-Lead', 'Operation Mode': 'Portable', 'Price': 'Rs 31,500/ Piece'}
---