将一张桌子分成几个美丽的汤[Python]

时间:2013-08-07 13:30:14

标签: python-2.7 beautifulsoup

我需要你帮助解决我无法找到的问题......

我有一个带tr和td的html表:

例如:

<table border="0" cellpadding="0" cellspacing="0">
    <tr>
     <td>
     </td>
    </tr>
    <tr>
     <td colspan="2">
      <br />
      <h2>
       Macros
      </h2>
     </td>
    </tr>
    <tr>
     <td>
      #define&nbsp;
     </td>
     <td>
      <a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745">
       SND_LSTINDIC
      </a>
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      liste sons indication
      <br />
     </td>
    </tr>
    <tr>
     <td colspan="2">
      <br />
      <h2>
       Définition de type
      </h2>
     </td>
    </tr>
    <tr>
     <td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
      typedef void(*&nbsp;
     </td>
     <td class="memItemRight" valign="bottom">
      <a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844">
       f_sndChangeFunc
      </a>
      )(
      <a class="el" href="#g4ab7db37a42f244764583a63997489a8">
       e_sndSound
      </a>
      i_eSound,
    aBOOL
    i_bStart,
    aBYTE
    i_byDisableModule)
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      Fonction rappel sur départ/arrêt bip.
      <a href="#g73cba8bd62d629eb05495a5c1a7b2844">
      </a>
      <br />
     </td>
    </tr>
    <tr>
     <td colspan="2">
      <br />
      <h2>
       Énumérations
      </h2>
     </td>
    </tr>
    <tr>
     <td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
      enum &nbsp;
     </td>
     <td class="memItemRight" valign="bottom">
      <a class="el" href="#g4ab7db37a42f244764583a63997489a8">
       e_sndSound
      </a>
      {
      }
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      identificateurs sons
      <a href="group__Sound.html#g4ab7db37a42f244764583a63997489a8">
       Plus de détails...
      </a>
      <br />
     </td>
    </tr>
</table>

我尝试将此表拆分为几个。我想退出

标题并创建一个包含以下行的表格。

例如,此处的预期结果应为:

<h2>
  Macros
</h2>
<table border="0" cellpadding="0" cellspacing="0">
    <tr>
     <td>
     </td>
    </tr>
    <tr>
     <td colspan="2">
      <br />
     </td>
    </tr>
    <tr>
     <td>
      #define&nbsp;
     </td>
     <td>
      <a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745">
       SND_LSTINDIC
      </a>
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      liste sons indication
      <br />
     </td>
    </tr>
  </table>

  <h2>
    Définition de type
  </h2>
  <table>
    <tr>
     <td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
      typedef void(*&nbsp;
     </td>
     <td class="memItemRight" valign="bottom">
      <a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844">
       f_sndChangeFunc
      </a>
      )(
      <a class="el" href="#g4ab7db37a42f244764583a63997489a8">
       e_sndSound
      </a>
      i_eSound,
    aBOOL
    i_bStart,
    aBYTE
    i_byDisableModule)
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      Fonction rappel sur départ/arrêt bip.
      <a href="#g73cba8bd62d629eb05495a5c1a7b2844">
      </a>
      <br />
     </td>
    </tr>
  </table>

  <h2>
    Énumérations
  </h2>
  <table>
    <tr>
     <td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
      enum &nbsp;
     </td>
     <td class="memItemRight" valign="bottom">
      <a class="el" href="#g4ab7db37a42f244764583a63997489a8">
       e_sndSound
      </a>
      {
      }
     </td>
    </tr>
    <tr>
     <td class="mdescLeft">
      &nbsp;
     </td>
     <td class="mdescRight">
      identificateurs sons
      <a href="group__Sound.html#g4ab7db37a42f244764583a63997489a8">
       Plus de détails...
      </a>
      <br />
     </td>
    </tr>
</table>

我使用python和BeautifulSoup来解析我的html代码。 我先尝试了这个:

from BeautifulSoup import BeautifulSoup, NavigableString
import sys
import os

soup = BeautifulSoup(allHtml)

for table in htmlSoup.findAll("table"):
   h2s = table.findAll("h2")
      if h2s is not []:      
         FirstH2 = True
         LastH2 = False
         for i, h2 in enumerate(h2s):
            if h2 is not []:
               LastH2 = ( i == len(h2s) - 1 )

               h2.parent.replaceWithChildren() # <td> deleted
               h2.parent.replaceWithChildren() # <tr> deleted
               print h2.parent
               if FirstH2:
                  h2.replaceWith( h2.prettify() + '<table>' )
                  #h2_tag_idx = h2.parent.contents.index(h2) # other method to add Tags
                  #h2.parent.insert(h2_tag_idx + 1, '<b>OK</b>')
               else:
                  h2.replaceWith( '</table>' + h2.prettify() + '<table>' )

               FirstH2 = False

print soup.prettify()

但是没办法,它用HTML等效的ASCII代码替换我的标签......

我还试图获取表格中的所有内容,并在尝试重建几个表格后再次将它放入汤中但是它失败了......

我还尝试在String中获取表格,并将字符串拆分为分隔符并将所有subTable重新输入汤中,但它也失败了......

如果有人有想法,那就太棒了!

提前致谢!

1 个答案:

答案 0 :(得分:0)

我做了这个功能,它有效...

def getOutTitleFromTable(htmlSoup):
   for ii, table in enumerate(htmlSoup.findAll("table")):
      h2s = table.findAll("h2") # on cherche tous les <h2></h2> dans le tableau
      #print h2s
      if len(h2s) > 0: #si on a au moins 1 <h2> dans le tableau   
         FirstH2 = True
         LastH2 = False
         newTables = BeautifulSoup() # contiendra nos tableaux reconstitués
         for i, h2 in enumerate(h2s):
            if h2 is not []:
               LastH2 = ( i == len(h2s) - 1 )
               h2.parent.replaceWithChildren() # on supprime le <td>
               h2.parent.replaceWithChildren() # on supprime le <tr>

               idT = "table"+str(ii)+str(i) # création d'un id de tableau pour une meilleure lisibilité
               wrapTable = Tag(htmlSoup, "table")
               wrapTable["id"]=idT
               wrapTable["border"]=0
               wrapTable["cellpadding"]=0
               wrapTable["cellspacing"]=0
               #print h2.parent.contents.index(h2) # index du h2 dans l'arbre table
               table.insert(h2.parent.contents.index(h2)+1, wrapTable) # on ajoute <table></table> après chaque <h2>"title"</h2>
               #newTable = table.findAll("table")
               newTable = table.find(name="table", attrs={"id" : idT})
               fillTable = False
               #print table.findAll(["h2","tr"])
               for tr in table.findAll(["h2","tr"]):
                  if fillTable:
                     if tr in h2s:
                        #print "fin du nouveau tableau"
                        #print tr
                        fillTable = False
                        break
                     else:
                        if tr.find("h2") not in h2s:
                           #print "ajout d'une nouvelle ligne: "
                           newTable.contents.append(tr)
                           #print newTable.contents

                  if str(tr) == str(h2):
                     #print "Début du nouveau tableau"
                     #print tr
                     fillTable = True

               newTables.append(h2)
               newTables.append(newTable)

               #os.system("pause")

               #print h2
               #print FirstH2
               #print LastH2
               FirstH2 = False

         #print newTables
         table.contents = newTables
         table.name = "div" # On change la balise table en div... on triche mais je n'arrive absolument pas à retirer le wrap <table></table>

如果某人有更好的解决方案,我会很高兴看到它。

再见