我正在尝试提取产品的技术属性。该产品有时可能是电气的,机械的或其他产品。这是带有技术特性和值的电气产品详细信息的示例
<section>
<div class="columns">
<div class="column">
<div class="message is-primary">
<header class="message-header">
<h4>Technical Characteristics</h4>
</header>
<div class="message-body">
<dl class="dl-horizontal">
<dt>ELECTRICAL RESISTANCE</dt>
<dd>(AAPP) 3.300 MEGOHMS</dd>
<dt>AMBIENT TEMP IN DEG CELSIUS AT FULL RATED POWER</dt>
<dd>(AAQF) 70.0</dd>
<dt>RESISTANCE TOLERANCE IN PERCENT</dt>
<dd>(AAPQ) -5.000/+5.000</dd><dt>POWER DISSIPATION RATING IN WATTS</dt>
<dd>(AEFB) 0.250 FREE AIR</dd><dt>STYLE DESIGNATOR</dt>
<dd>(TEST) 81349-MIL-R-11/8 SPECIFICATION (INCLUDES ENGINEERINGIONS THAT ARE SHOWN AS "TYPICAL", "AVERAGE", "NOMINAL", ETC.).</dd>
</dl>
</div>
</div>
</div>
</div>
</section>
我可以使用此python脚本提取电气属性键和值
productsoup = BeautifulSoup(productdriver.page_source,"lxml");
try:
for li in productsoup.find_all('dt',text=re.compile('^(ELECTRICAL RESISTANCE)|^(AMBIENT TEMP)|^(RESISTANCE TOLERANCE)|^(DISSIPATION)')):
但是有时机械产品可能具有这种格式
<section>
<div class="columns">
<div class="column">
<div class="message is-primary">
<header class="message-header">
<h4>Technical Characteristics</h4>
</header>
<div class="message-body">
<dl class="dl-horizontal">
<dt>END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd><dt>BODY STYLE</dt>
<dd>(AAQL) TUBE TYPE</dd><dt>CONTINUOUS CURRENT RATING IN AMPS</dt>
<dd>(AEBJ) 1.600</dd><dt>III END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
</dl>
</div>
</div>
</div>
</div>
</section>
如何提取技术特性(dt
)和相应的值(dd
)?
答案 0 :(得分:0)
您可以尝试这样的事情:
from bs4 import BeautifulSoup
html = """<section>
<div class="columns">
<div class="column">
<div class="message is-primary">
<header class="message-header">
<h4>Technical Characteristics</h4>
</header>
<div class="message-body">
<dl class="dl-horizontal">
<dt>END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
<dt>BODY STYLE</dt>
<dd>(AAQL) TUBE TYPE</dd>
<dt>CONTINUOUS CURRENT RATING IN AMPS</dt>
<dd>(AEBJ) 1.600</dd>
<dt>III END ITEM IDENTIFICATION</dt>
<dd>(AGAV) END ITEM 6675014301965</dd>
</dl>
</div>
</div>
</div>
</div>
</section>"""
soup = BeautifulSoup(html, 'html.parser')
dts = soup.find_all("dt")
outs = {i.string: i.find_next("dd").string for i in dts}
print(outs)
#> {'END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965', 'BODY STYLE': '(AAQL) TUBE TYPE', 'CONTINUOUS CURRENT RATING IN AMPS': '(AEBJ) 1.600', 'III END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965'}
创建于2018-09-28
import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-17.7.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-09-28
#> Packages ------------------------------------------------------------------------
#> beautifulsoup4==4.6.3
#> reprexpy==0.1.1