我需要你帮助解决我无法找到的问题......
我有一个带tr和td的html表:
例如:
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td>
</td>
</tr>
<tr>
<td colspan="2">
<br />
<h2>
Macros
</h2>
</td>
</tr>
<tr>
<td>
#define
</td>
<td>
<a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745">
SND_LSTINDIC
</a>
</td>
</tr>
<tr>
<td class="mdescLeft">
</td>
<td class="mdescRight">
liste sons indication
<br />
</td>
</tr>
<tr>
<td colspan="2">
<br />
<h2>
Définition de type
</h2>
</td>
</tr>
<tr>
<td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
typedef void(*
</td>
<td class="memItemRight" valign="bottom">
<a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844">
f_sndChangeFunc
</a>
)(
<a class="el" href="#g4ab7db37a42f244764583a63997489a8">
e_sndSound
</a>
i_eSound,
aBOOL
i_bStart,
aBYTE
i_byDisableModule)
</td>
</tr>
<tr>
<td class="mdescLeft">
</td>
<td class="mdescRight">
Fonction rappel sur départ/arrêt bip.
<a href="#g73cba8bd62d629eb05495a5c1a7b2844">
</a>
<br />
</td>
</tr>
<tr>
<td colspan="2">
<br />
<h2>
Énumérations
</h2>
</td>
</tr>
<tr>
<td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
enum
</td>
<td class="memItemRight" valign="bottom">
<a class="el" href="#g4ab7db37a42f244764583a63997489a8">
e_sndSound
</a>
{
}
</td>
</tr>
<tr>
<td class="mdescLeft">
</td>
<td class="mdescRight">
identificateurs sons
<a href="group__Sound.html#g4ab7db37a42f244764583a63997489a8">
Plus de détails...
</a>
<br />
</td>
</tr>
</table>
我尝试将此表拆分为几个。我想退出
例如,此处的预期结果应为:
<h2>
Macros
</h2>
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td>
</td>
</tr>
<tr>
<td colspan="2">
<br />
</td>
</tr>
<tr>
<td>
#define
</td>
<td>
<a class="el" href="#g3e3da223d2db3b49a9b6e3ee6f49f745">
SND_LSTINDIC
</a>
</td>
</tr>
<tr>
<td class="mdescLeft">
</td>
<td class="mdescRight">
liste sons indication
<br />
</td>
</tr>
</table>
<h2>
Définition de type
</h2>
<table>
<tr>
<td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
typedef void(*
</td>
<td class="memItemRight" valign="bottom">
<a class="el" href="#g73cba8bd62d629eb05495a5c1a7b2844">
f_sndChangeFunc
</a>
)(
<a class="el" href="#g4ab7db37a42f244764583a63997489a8">
e_sndSound
</a>
i_eSound,
aBOOL
i_bStart,
aBYTE
i_byDisableModule)
</td>
</tr>
<tr>
<td class="mdescLeft">
</td>
<td class="mdescRight">
Fonction rappel sur départ/arrêt bip.
<a href="#g73cba8bd62d629eb05495a5c1a7b2844">
</a>
<br />
</td>
</tr>
</table>
<h2>
Énumérations
</h2>
<table>
<tr>
<td class="memItemLeft" nowrap="nowrap" align="right" valign="top">
enum
</td>
<td class="memItemRight" valign="bottom">
<a class="el" href="#g4ab7db37a42f244764583a63997489a8">
e_sndSound
</a>
{
}
</td>
</tr>
<tr>
<td class="mdescLeft">
</td>
<td class="mdescRight">
identificateurs sons
<a href="group__Sound.html#g4ab7db37a42f244764583a63997489a8">
Plus de détails...
</a>
<br />
</td>
</tr>
</table>
我使用python和BeautifulSoup来解析我的html代码。 我先尝试了这个:
from BeautifulSoup import BeautifulSoup, NavigableString
import sys
import os
soup = BeautifulSoup(allHtml)
for table in htmlSoup.findAll("table"):
h2s = table.findAll("h2")
if h2s is not []:
FirstH2 = True
LastH2 = False
for i, h2 in enumerate(h2s):
if h2 is not []:
LastH2 = ( i == len(h2s) - 1 )
h2.parent.replaceWithChildren() # <td> deleted
h2.parent.replaceWithChildren() # <tr> deleted
print h2.parent
if FirstH2:
h2.replaceWith( h2.prettify() + '<table>' )
#h2_tag_idx = h2.parent.contents.index(h2) # other method to add Tags
#h2.parent.insert(h2_tag_idx + 1, '<b>OK</b>')
else:
h2.replaceWith( '</table>' + h2.prettify() + '<table>' )
FirstH2 = False
print soup.prettify()
但是没办法,它用HTML等效的ASCII代码替换我的标签......
我还试图获取表格中的所有内容,并在尝试重建几个表格后再次将它放入汤中但是它失败了......
我还尝试在String中获取表格,并将字符串拆分为分隔符并将所有subTable重新输入汤中,但它也失败了......
如果有人有想法,那就太棒了!
提前致谢!
答案 0 :(得分:0)
我做了这个功能,它有效...
def getOutTitleFromTable(htmlSoup):
for ii, table in enumerate(htmlSoup.findAll("table")):
h2s = table.findAll("h2") # on cherche tous les <h2></h2> dans le tableau
#print h2s
if len(h2s) > 0: #si on a au moins 1 <h2> dans le tableau
FirstH2 = True
LastH2 = False
newTables = BeautifulSoup() # contiendra nos tableaux reconstitués
for i, h2 in enumerate(h2s):
if h2 is not []:
LastH2 = ( i == len(h2s) - 1 )
h2.parent.replaceWithChildren() # on supprime le <td>
h2.parent.replaceWithChildren() # on supprime le <tr>
idT = "table"+str(ii)+str(i) # création d'un id de tableau pour une meilleure lisibilité
wrapTable = Tag(htmlSoup, "table")
wrapTable["id"]=idT
wrapTable["border"]=0
wrapTable["cellpadding"]=0
wrapTable["cellspacing"]=0
#print h2.parent.contents.index(h2) # index du h2 dans l'arbre table
table.insert(h2.parent.contents.index(h2)+1, wrapTable) # on ajoute <table></table> après chaque <h2>"title"</h2>
#newTable = table.findAll("table")
newTable = table.find(name="table", attrs={"id" : idT})
fillTable = False
#print table.findAll(["h2","tr"])
for tr in table.findAll(["h2","tr"]):
if fillTable:
if tr in h2s:
#print "fin du nouveau tableau"
#print tr
fillTable = False
break
else:
if tr.find("h2") not in h2s:
#print "ajout d'une nouvelle ligne: "
newTable.contents.append(tr)
#print newTable.contents
if str(tr) == str(h2):
#print "Début du nouveau tableau"
#print tr
fillTable = True
newTables.append(h2)
newTables.append(newTable)
#os.system("pause")
#print h2
#print FirstH2
#print LastH2
FirstH2 = False
#print newTables
table.contents = newTables
table.name = "div" # On change la balise table en div... on triche mais je n'arrive absolument pas à retirer le wrap <table></table>
如果某人有更好的解决方案,我会很高兴看到它。
再见