提取关键部分的字符串,其中内容用<span style =“ font-weight:bold”>标记括起来

时间:2019-03-17 22:04:11

标签: python

我得到一个Web服务的答案:

    <html xmlns="http://www.w3.org/TR/REC-html40">
    <head>
    <title>Grampal </title>
    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    <meta name="Content-Language" content="EN">
    <meta name="author" content="jmguirao@ugr.es">
    <link rel="icon" type="image/ico" href="/favicon.ico" />
    <style type="text/css">
    html,body,form,ul,li,h1,h3,p{margin:0; padding:0}
    body { font-family: Arial, Helvetica, sans-serif; background-color:#fff }
    a { text-decoration: none; }
    a:hover { text-decoration: underline }
    ul { list-style-type: none }
    td {padding: 0.5pc 2pc 0pc 0pc }
    .nav { float: right; padding: 0.5pc 0.5pc 0.5pc 0.5pc; margin-left:5px }
    .nav li {display:inline; border-left: 1px solid #444; padding:0 0.4em;}
    .nav li.first {border-left:0}
    .hide { display:none }
    input { text-indent: 2px }
    input[type="submit"] { text-indent: 0 }
    DIV.delPage { padding: 0.5ex 5em 0.5em 5em; background-color:#ffd6ba; }
    .delMain { padding: 2ex 0.5em 0.5pc 0.5em; }
    .post { margin-bottom: 0.25pc; font-size: 100%; padding-top: 0.5ex; }
    .posts, #posts { padding: 0.5ex 0.5em 0.5pc 50px; }
    .banner { padding: 0.5ex 0 0.5pc 0.5em; background-color: #ffc6aa;clear: both }
    .banner h1 {
            font-weight: bolder; font-size: 150%;
            margin:0; padding:0 0 0 26px; display: inline;}
    h2 {
         font-weight: bolder; font-size: 140%; color: red;
         margin:0; padding:0 0 0 26px; display: inline;}
    .resaltado {font-weight: bolder;font-size: 100%}        
    </style>

    </head>
    <body>
    <div class="banner">
    <ul class="hide"><li><a href="#content">skip to content</a></li></ul>
    <ul class="nav">Análsis de:
    <li class="first">
    <a title="Analizador morfosintáctico" href="/grampal/grampal.cgi?m=analiza&e=factura">palabras</a></li>
    <li><a title="Desambiguador contextual" href="/grampal/grampal.cgi?m=etiqueta&e=factura">oraciones</a></li>
    <li><a title="Etiquetado de textos" href="/grampal/grampal.cgi?m=xml">textos</a></li>
    <li><a title="Formas de una palabra" href="/grampal/grampal.cgi?m=genera&e=factura">Generación de formas</a></li>
    <!--
    <li><a title="Transcripción fonética" href="/grampal/grampal.cgi?m=transcribe&e=factura">Transcripción</a></li>
    -->
    <li><a href="/grampal/grampal.cgi?m=etiquetario">Etiquetario</a></li>
    <li><a href="/grampal/grampal.cgi?m=autores">Autores</a></li>
    </ul>
    <h1>Grampal</h1>
    </div>



    <div class="delPage" style="font-size: 80%;">
    <form method="GET" action="/grampal/grampal.cgi">
    <input type="hidden" name="m" value="analiza">
    <input type="hidden" name="csrf" value="651c4fcfae059a6e31c39a902f6d27c8">
    <span class="resaltado">Palabra : </span><input name="e" size="60" value="factura">
    <input type="submit" value="Analiza"> &nbsp;

    </form>
    </div>
    <br>
    <h2>factura</h2>



    <div class="delMain">
    <div id="posts">

    <table>

    <tr>

    <td style="font-style:italic;font-size:90%">categoría&nbsp;<span style="font-weight:bold"> N </span></td>

    <td style="font-style:italic;font-size:90%">lema&nbsp;<span style="font-weight:bold"> FACTURA </span></td>

    <td style="font-style:italic;font-size:90%">género&nbsp;<span style="font-weight:bold"> femenino </span></td>

    <td style="font-style:italic;font-size:90%">número&nbsp;<span style="font-weight:bold"> singular </span></td>

    </tr>

    <tr>

    <td style="font-style:italic;font-size:90%">categoría&nbsp;<span style="font-weight:bold"> V </span></td>

    <td style="font-style:italic;font-size:90%">lema&nbsp;<span style="font-weight:bold"> FACTURAR </span></td>

    <td style="font-style:italic;font-size:90%">número&nbsp;<span style="font-weight:bold"> singular </span></td>

    <td style="font-style:italic;font-size:90%">persona&nbsp;<span style="font-weight:bold"> 3 </span></td>

    <td style="font-style:italic;font-size:90%">tiempo&nbsp;<span style="font-weight:bold"> presente indicativo </span></td>

    </tr>

    <tr>

    <td style="font-style:italic;font-size:90%">categoría&nbsp;<span style="font-weight:bold"> V </span></td>

    <td style="font-style:italic;font-size:90%">lema&nbsp;<span style="font-weight:bold"> FACTURAR </span></td>

    <td style="font-style:italic;font-size:90%">número&nbsp;<span style="font-weight:bold"> singular </span></td>

    <td style="font-style:italic;font-size:90%">persona&nbsp;<span style="font-weight:bold"> 2 </span></td>

    <td style="font-style:italic;font-size:90%">tiempo&nbsp;<span style="font-weight:bold"> imperativo </span></td>

    </tr>

    </table>

    </div>

    </div>  
    </body>
    </html>

但是我只想获取<span style="font-weight:bold">标记内的所有内容。有最佳的方法吗?就我所知,我只能使用.split来实现它,但我认为这不是一种非常优雅或非常理想的实现方式。我想了解实现它的最佳方法或最优雅的方法。

这是我想要的输出:

[
N,
FACTURA,
femenino,
singular,
.
.
.]

1 个答案:

答案 0 :(得分:0)

You can use regular expressions here:

import re
result = re.findall(r'<span style="font-weight:bold">(.*?)<', html_document)