如何使用python

时间:2018-09-29 01:12:40

标签: python html beautifulsoup

我正在尝试提取产品的技术属性。该产品有时可能是电气的,机械的或其他产品。这是带有技术特性和值的电气产品详细信息的示例

<section>
    <div class="columns">
        <div class="column">
            <div class="message is-primary">
                <header class="message-header">
                    <h4>Technical Characteristics</h4>
                </header>
                <div class="message-body">
                    <dl class="dl-horizontal">
                        <dt>ELECTRICAL RESISTANCE</dt>
                        <dd>(AAPP) 3.300 MEGOHMS</dd>
                        <dt>AMBIENT TEMP IN DEG CELSIUS AT FULL RATED POWER</dt>
                        <dd>(AAQF) 70.0</dd>
                         <dt>RESISTANCE TOLERANCE IN PERCENT</dt>
                        <dd>(AAPQ) -5.000/+5.000</dd><dt>POWER DISSIPATION RATING IN WATTS</dt>
                        <dd>(AEFB) 0.250 FREE AIR</dd><dt>STYLE DESIGNATOR</dt>
        
                        <dd>(TEST) 81349-MIL-R-11/8 SPECIFICATION (INCLUDES ENGINEERINGIONS THAT ARE SHOWN AS "TYPICAL", "AVERAGE", "NOMINAL", ETC.).</dd>
                    </dl>
                </div>
            </div>
        </div>
    </div>
</section>

我可以使用此python脚本提取电气属性键和值

productsoup = BeautifulSoup(productdriver.page_source,"lxml");

try:

   for li in productsoup.find_all('dt',text=re.compile('^(ELECTRICAL RESISTANCE)|^(AMBIENT TEMP)|^(RESISTANCE TOLERANCE)|^(DISSIPATION)')):

但是有时机械产品可能具有这种格式

<section>
    <div class="columns">
        <div class="column">
            <div class="message is-primary">
                <header class="message-header">
                    <h4>Technical Characteristics</h4>
                </header>
                <div class="message-body">
                    <dl class="dl-horizontal">
                        <dt>END ITEM IDENTIFICATION</dt>
                        <dd>(AGAV) END ITEM 6675014301965</dd><dt>BODY STYLE</dt>
                        <dd>(AAQL) TUBE TYPE</dd><dt>CONTINUOUS CURRENT RATING IN AMPS</dt>
                        <dd>(AEBJ) 1.600</dd><dt>III END ITEM IDENTIFICATION</dt>
                        <dd>(AGAV) END ITEM 6675014301965</dd>
                    </dl>
                </div>
            </div>
        </div>
    </div>
</section>

如何提取技术特性(dt)和相应的值(dd)?

1 个答案:

答案 0 :(得分:0)

您可以尝试这样的事情:

from bs4 import BeautifulSoup

html = """<section>
    <div class="columns">
        <div class="column">
            <div class="message is-primary">
                <header class="message-header">
                    <h4>Technical Characteristics</h4>
                </header>
                <div class="message-body">
                    <dl class="dl-horizontal">
                        <dt>END ITEM IDENTIFICATION</dt>
                        <dd>(AGAV) END ITEM 6675014301965</dd>
                        <dt>BODY STYLE</dt>
                        <dd>(AAQL) TUBE TYPE</dd>
                        <dt>CONTINUOUS CURRENT RATING IN AMPS</dt>
                        <dd>(AEBJ) 1.600</dd>
                        <dt>III END ITEM IDENTIFICATION</dt>
                        <dd>(AGAV) END ITEM 6675014301965</dd>
                    </dl>
                </div>
            </div>
        </div>
    </div>
</section>"""

soup = BeautifulSoup(html, 'html.parser')
dts = soup.find_all("dt")
outs = {i.string: i.find_next("dd").string for i in dts}
print(outs)
#> {'END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965', 'BODY STYLE': '(AAQL) TUBE TYPE', 'CONTINUOUS CURRENT RATING IN AMPS': '(AEBJ) 1.600', 'III END ITEM IDENTIFICATION': '(AGAV) END ITEM 6675014301965'}

reprexpy package

创建于2018-09-28
import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-17.7.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-09-28
#> Packages ------------------------------------------------------------------------
#> beautifulsoup4==4.6.3
#> reprexpy==0.1.1