如何使用python从HTML代码中提取特定元素

时间:2018-03-17 15:51:40

标签: python html web-scraping beautifulsoup

我对HTML语言不太自信,而且我在使用Python解析这部分HTML代码(print soup.prettify()的结果)时遇到了麻烦。



$("#global-flash").html("");

$('#reviews-tab-navigation').trigger('repaint');
$('#edit-review-tab').html('
<div class='\"row-fluid\"'>
 \n
 <div class='\"span3\"'>
  \n
  <div class='\"label' full-height="" id='\"review-search-result-panel\"' use-bootstrap-tables\"="">
   \n
   <span class='\"panel-headline\"'>
    Rezensionsdaten&lt;\/span&gt;\n
    <hr/>
    \n\n
    <table class='\"table' id='\"review-search-result-list\"' table-hover="" table-striped\"="">
     \n
     <thead>
      \n
      <tr>
       \n
       <th>
        \n
        <span class='\"review-count\"'>
         5&lt;\/span&gt;\n\n                Rezensionen gefunden\n            &lt;\/th&gt;\n          &lt;\/tr&gt;\n        &lt;\/thead&gt;\n\n
         <tbody>
          \n
          <tr>
           \n
           <td class='\"selectable-review-entry\"' data-mastertstyle-id='\"\"' data-review-id='\"10613555\"'>
            \n
            <span btn-link="" btn-small="" class='\"btn' review-list-link\"="">
             \n                  5\n
             <img 2015\"="" alt='\"Bewertung' src='\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\"' stern=""/>
             \n\n
             <span aderisce="" anzi="" bene="" colore="" come="" difettucci="" e="" foto,="" i="" in="" morbidissima,="" non="" pelle,="" piacevole="" rotolini.\"="" segnare="" senza="" stringe="" sulla="" title='\"Bel'>
              Bel colore come in foto, morbidissima, piacevole sulla pelle, non stringe anzi aderisce bene senza segnare i difettucci e i rotolini.&lt;\/span&gt;\n                &lt;\/span&gt;\n              &lt;\/td&gt;\n            &lt;\/tr&gt;\n
              <tr>
               \n
               <td class='\"selectable-review-entry\"' data-mastertstyle-id='\"\"' data-review-id='\"10610141\"'>
                \n
                <span btn-link="" btn-small="" class='\"btn' review-list-link\"="">
                 \n                  5\n
                 <img 2015\"="" alt='\"Bewertung' src='\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\"' stern=""/>
                 \n\n
                 <span title='\"bella\"'>
                  bella&lt;\/span&gt;\n                &lt;\/span&gt;\n              &lt;\/td&gt;\n            &lt;\/tr&gt;\n
                  <tr>
                   \n
                   <td class='\"selectable-review-entry\"' data-mastertstyle-id='\"\"' data-review-id='\"10575319\"'>
                    \n
                    <span btn-link="" btn-small="" class='\"btn' review-list-link\"="">
                     \n                  4\n
                     <img 2015\"="" alt='\"Bewertung' src='\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\"' stern=""/>
                     \n\n
                     <span buona="" morbido.\"="" qualità-prezzo,="" rapporto="" tessuto="" title='\"Buon' vestibilità,="">
                      Buon rapporto qualità-prezzo, buona vestibilità, tessuto morbido.&lt;\/span&gt;\n                &lt;\/span&gt;\n              &lt;\/td&gt;\n            &lt;\/tr&gt;\n
                      <tr>
                       \n
                       <td class='\"selectable-review-entry\"' data-mastertstyle-id='\"\"' data-review-id='\"10554514\"'>
                        \n
                        <span btn-link="" btn-small="" class='\"btn' review-list-link\"="">
                         \n                  5\n
                         <img 2015\"="" alt='\"Bewertung' src='\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\"' stern=""/>
                         \n\n
                         <span buon="" capo!="" giusto="" ottima="" peso\"="" qualità,="" title='\"Davvero' un="">
                          Davvero un buon capo! Ottima qualità, giusto peso&lt;\/span&gt;\n                &lt;\/span&gt;\n              &lt;\/td&gt;\n            &lt;\/tr&gt;\n
                          <tr>
                           \n
                           <td class='\"selectable-review-entry\"' data-mastertstyle-id='\"\"' data-review-id='\"9469234\"'>
                            \n
                            <span btn-link="" btn-small="" class='\"btn' review-list-link\"="">
                             \n                  5\n
                             <img 2015\"="" alt='\"Bewertung' src='\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\"' stern=""/>
                             \n\n
                             <span ....="" altri="" anche="" bello="" colori="" e="" funzionale.="" in="" regolare.\"="" taglia="" title='\"Preso'>
                              Preso anche in altri colori .... bello e funzionale. Taglia regolare.&lt;\/span&gt;\n                &lt;\/span&gt;\n              &lt;\/td&gt;\n            &lt;\/tr&gt;\n        &lt;\/tbody&gt;\n      &lt;\/table&gt;\n    &lt;\/div&gt;\n  &lt;\/div&gt;\n\n
                              <div class='\"span9\"'>
                               \n
                               <div class='\"row-fluid\"'>
                                \n
                                <div class='\"span3\"'>
                                 \n
                                 <div class='\"label' full-height\"="" id='\"product-data-panel\"'>
                                  \n
                                  <span class='\"panel-headline\"'>
                                   Informazioni articolo&lt;\/span&gt;\n
                                   <hr/>
                                   \n
                                   <a href='\"https://www.bonprix.it/search.htm?qu=95341195\"' target='\"_blank\"'>
                                    <img src="\'http://image01.bonprix.de/bonprixbilder//assets/114x160/13050022.jpg\'"/>
                                    &lt;\/a&gt;\n
                                    <label>
                                     N. art.&lt;\/label&gt;\n
                                     <a class='\"btn-link\"' href='\"https://www.bonprix.it/search.htm?qu=95341195\"' target='\"_blank\"'>
                                      95341195&lt;\/a&gt;\n
                                      <label>
                                       Masterstyle-ID&lt;\/label&gt;\n52826321\n
                                       <label>
                                        Digistyle-ID&lt;\/label&gt;\n12709620\n
                                        <label>
                                         Ø Media dei voti&lt;\/label&gt;\n4.45
                                         <img 2015\"="" alt='\"Bewertung' src='\"http://bp-webtools1.otto.boreus.de/tools/images/app/reviews/bewertung_stern_2015.png\"' stern="">
                                          \n
                                          <label>
                                           Lunghezza&lt;\/label&gt;\nGiusto\n
                                           <label>
                                            Larghezza&lt;\/label&gt;\nGiusto\n
                                            <label>
                                             Disponibilità&lt;\/label&gt;\n\n(37)\n\n        &lt;\/div&gt;\n      &lt;\/div&gt;\n\n
                                             <div class='\"span5\"'>
                                              \n
                                              <div class='\"label' full-height\"="" id='\"single-review-panel\"'>
                                               \n
                                               <span class='\"panel-headline\"'>
                                                Dati cliente&lt;\/span&gt;\n
                                                <hr/>
                                                \n
                                                <table class='\"customer-info-table\"'>
                                                 \n
                                                 <tr>
                                                  \n
                                                  <td>
                                                   \n
                                                   <label>
                                                    Nome&lt;\/label&gt;\n      nome\n    &lt;\/td&gt;\n
                                                    <td>
                                                     \n
                                                     <label>
                                                      Cognome&lt;\/label&gt;\n      cognome\n    &lt;\/td&gt;\n  &lt;\/tr&gt;\n
                                                      <tr>
                                                       \n
                                                       <td>
                                                        \n
                                                        <label>
                                                         Codice cliente&lt;\/label&gt;\n      N/A\n    &lt;\/td&gt;\n
                                                         <td>
                                                          \n
                                                          <label>
                                                           Indirizzo e-mail&lt;\/label&gt;\n      ********@gmail.com\n    &lt;\/td&gt;\n  &lt;\/tr&gt;\n&lt;\/table&gt;\n\n
                                                           <span class='\"panel-headline\"'>
                                                            Commento articolo&lt;\/span&gt;\n
                                                            <hr/>
                                                            \n
                                                            <i class='\"rating' r5\"="">
                                                             &lt;\/i&gt;
                                                             <br/>
                                                             \n\n
                                                             <textarea id='\"review-text\"' name='\"text\"' readonly='\"readonly\"' rows='\"12\"'>\nBel colore come in foto, morbidissima, piacevole sulla pelle, non stringe anzi aderisce bene senza segnare i difettucci e i rotolini.&lt;\/textarea&gt;\n\n<span class='\"panel-headline\"'>Commenti sulla vestibilità&lt;\/span&gt;\n<hr/>\n<table class='\"size-info-table\"'>\n  <tr>\n    <td>\n      <label>Lunghezza&lt;\/label&gt;\n      Giusto\n    &lt;\/td&gt;\n    <td>\n      <label>Larghezza&lt;\/label&gt;\n      Giusto\n    &lt;\/td&gt;\n    <td>\n      <label>Taglia&lt;\/label&gt;\n      62/64\n    &lt;\/td&gt;\n    <td>\n      <label>Varianti&lt;\/label&gt;\n       \n    &lt;\/td&gt;\n    <td>\n      <label>Statura&lt;\/label&gt;\n      165-169\n    &lt;\/td&gt;\n  &lt;\/tr&gt;\n&lt;\/table&gt;\n<p>\n  <table class='\"table\"'>\n    <tr>\n      <td>\n        <b>Rezensions-ID:&lt;\/b&gt;\n        <span id='\"review-id\"'>10613555&lt;\/span&gt;\n      &lt;\/td&gt;\n      <td>\n        <b>Creata:&lt;\/b&gt;\n        <span class='\"utc-date\"'>\n          01.10.2017 11:06:26\n        &lt;\/span&gt;\n      &lt;\/td&gt;\n    &lt;\/tr&gt;\n    <tr>\n      <td>\n        <b>Letzte Änderung&lt;\/b&gt;\n        <span class='\"utc-date\"'>\n          01.10.2017 11:06:26\n        &lt;\/span&gt;\n      &lt;\/td&gt;\n      <td>\n        <b>di&lt;\/b&gt;\n        Kunde\n      &lt;\/td&gt;\n    &lt;\/tr&gt;\n    <tr>\n      <td>\n        <b>Data pubblicazione:&lt;\/b&gt;\n        <span class='\"utc-date\"'>\n          01.10.2017 11:06:26\n        &lt;\/span&gt;\n      &lt;\/td&gt;\n    &lt;\/tr&gt;\n  &lt;\/table&gt;\n&lt;\/p&gt;\n\n        &lt;\/div&gt;\n      &lt;\/div&gt;\n\n      <div class='\"span4\"'>\n        <div class='\"label' full-height\"="" id='\"editing-functions-panel\"'>\n            <span class='\"panel-headline\"'>Modifica&lt;\/span&gt;\n<hr/>\n<div>\n  <label>Scegli un destinatario&lt;\/label&gt;\n  <a class='\"btn-link\"' false;\"="" href='\"#\"' id='\"reset-recipients-list-link\"' onclick='\"reviews.resetRecipientsList(true);' return="">Cancella la lista destinatari&lt;\/a&gt;\n  <select id='\"email-recipients-select\"' name='\"email-recipients-select\"'><option value='\"\"'>&lt;\/option&gt;\n<option value='\"*****@*****.it\"'>servizio@******.it&lt;\/option&gt;&lt;\/select&gt;\n  <textarea id='\"email-recipients-textarea\"' name='\"email-recipients-textarea\"'>\n&lt;\/textarea&gt;\n  <a class='\"btn\"' data-confirm-translation-modified-text='\"Die' false;\"="" gespeichert.="" href='\"#\"' id='\"send-mail-btn\"' nicht="" noch="" onclick='\"reviews.sendMail(true);' return="" rezension="" trotzdem="" versenden?\"="" wurde="" übersetzung="">Invia recensione&lt;\/a&gt;\n  <label>Traduci&lt;\/label&gt;\n  <textarea id='\"review-uebersetzung\"' name='\"text\"'>\n&lt;\/textarea&gt;\n  <label>Feedback al cliente&lt;\/label&gt;\n  <textarea id='\"review-feedbackToCustomer\"' name='\"text\"'>\n&lt;\/textarea&gt;\n&lt;\/div&gt;\n<div>\n  <label>Tipo di recensione&lt;\/label&gt;\n  <select id='\"review-meinungstyp\"' name='\"meinungstyp\"'><option selected='\"selected\"' value='\"R\"'>Recensione&lt;\/option&gt;\n<option value='\"G\"'>Risposte&lt;\/option&gt;\n<option value='\"A\"'>Archivio&lt;\/option&gt;&lt;\/select&gt;\n&lt;\/div&gt;\n<div id='\"aktiv-checkboxes-container\"'>\n  <div class='\"control-group' use-bootstrap-groups\"="">\n    <label class='\"control-label\"' for='\"review_aktiv\"'>Pubblicata&lt;\/label&gt;\n    <input id='\"review_aktiv\"' name='\"review_aktiv\"' type='\"hidden\"' value='\"T\"'/>\n    <div class='\"controls\"'>\n      <div class='\"btn-group\"'>\n        <a btn="" btn-success\"="" class='\"change-active-state' data-value='\"T\"' href='\"#\"'>Sì&lt;\/a&gt;\n        <a \"="" btn="" class='\"change-active-state' data-value='\"F\"' href='\"#\"'>No&lt;\/a&gt;\n      &lt;\/div&gt;\n    &lt;\/div&gt;\n  &lt;\/div&gt;\n&lt;\/div&gt;\n\n<div class='\"row-fluid' form-actions="" possible-multi-line\"="">\n  <a btn-primary\"="" class='\"btn' false;\"="" href='\"#\"' id='\"save-review-btn\"' onclick='\"reviews.saveReview(true);' remote='\"true\"' return="">Salva recensione&lt;\/a&gt;\n  <a btn-danger\"="" class='\"btn' data-confirm-dialog-title='\"Cancella' false;\"="" href='\"#\"' id='\"delete-review-btn\"' onclick='\"reviews.deleteSelectedReview(true);' recensioni\"="" remote='\"true\"' return=""><i class="\'icon-trash" icon-white\'="">&lt;\/i&gt; Cancella recensioni&lt;\/a&gt;\n&lt;\/div&gt;\n\n        &lt;\/div&gt;\n      &lt;\/div&gt;\n    &lt;\/div&gt;\n  &lt;\/div&gt;\n&lt;\/div&gt;\n').trigger('repaint');
reviews.initEditReviewTab();
$('#reviews-tab-navigation').tabs('option', 'active', 0);
$('.search-tab-buttons').html('<div class='\"search-tab-buttons\"'>\n  <table>\n    <tr>\n        <td><a btn-primary\"="" class='\"btn' false;\"="" href='\"#\"' onclick='\"reviews.submitSearchReviews();' remote='\"true\"' return="">Cerca&lt;\/a&gt;&lt;\/td&gt;\n      <td><a btn-default\"="" class='\"btn' false;\"="" href='\"#\"' onclick='\"reviews.setDefaultSearchParams();' remote='\"true\"' return="">Ricerca standard&lt;\/a&gt;&lt;\/td&gt;\n      <td><a btn-default\"="" class='\"btn' false;\"="" href='\"#\"' onclick='\"reviews.showStatistics(true);' remote='\"true\"' return="">Statistiche&lt;\/a&gt;&lt;\/td&gt;\n    &lt;\/tr&gt;\n  &lt;\/table&gt;\n&lt;\/div&gt;');
$('.mini-statistics').replaceWith('  <div class='\"mini-statistics\"'>\n    <p>\n      Da controllare: 100 / Pubblicata: 304316 / Non pubblicata: 9207 / Prenotate: [0], mie: [0]\n    &lt;\/p&gt;\n  &lt;\/div&gt;\n');
</p></div></a></td></a></td></a></td></tr></table></div></i></a></a></div></a></a></div></div></label></div></div></option></option></option></select></label></div></textarea></label></textarea></label></a></textarea></option></option></select></a></label></div></span></div></div></span></b></td></tr></b></td></span></b></td></tr></span></b></td></span></b></td></tr></table></p></label></td></label></td></label></td></label></td></label></td></tr></table></span></textarea>
                                                            </i>
                                                           </span>
                                                          </label>
                                                         </td>
                                                        </label>
                                                       </td>
                                                      </tr>
                                                     </label>
                                                    </td>
                                                   </label>
                                                  </td>
                                                 </tr>
                                                </table>
                                               </span>
                                              </div>
                                             </div>
                                            </label>
                                           </label>
                                          </label>
                                         </img>
                                        </label>
                                       </label>
                                      </label>
                                     </a>
                                    </label>
                                   </a>
                                  </span>
                                 </div>
                                </div>
                               </div>
                              </div>
                             </span>
                            </span>
                           </td>
                          </tr>
                         </span>
                        </span>
                       </td>
                      </tr>
                     </span>
                    </span>
                   </td>
                  </tr>
                 </span>
                </span>
               </td>
              </tr>
             </span>
            </span>
           </td>
          </tr>
         </tbody>
        </span>
       </th>
      </tr>
     </thead>
    </table>
   </span>
  </div>
 </div>
</div>
&#13;
&#13;
&#13;

基本上我想在每个&#34; data-review-id&#34;之后提取数字。 (在这部分html中有5:10613555,10610141,10575319,10554514,9469234)但我不明白应该选择哪些标签来获得我想要的结果。

我已尝试过几种汤.find_all的组合,但没有任何结果。

任何帮助或建议都会非常感激。

提前致谢!

1 个答案:

答案 0 :(得分:0)

您拥有的HTML是在一些Javascript中,并且似乎已被转义。复制/粘贴您提供的确切HTML并将其分配给html,可以使用以下内容:

from bs4 import BeautifulSoup

html = """ ---- add HTML here ---"""

html = html.replace('"', ''). replace(r'\/', '/')
soup = BeautifulSoup(html, "html.parser")

for td in soup.find_all('td', {'data-review-id':True}):
    print td['data-review-id']

然后显示:

10613555
10610141
10575319
10554514
9469234