在列表中执行BeautifulSoup操作,同时在python中维护结构

时间:2014-05-27 00:33:10

标签: python list beautifulsoup

我有一个美丽的汤对象列表,我正在尝试进一步解析细胞的内容。我的输出变成了列表,每个列表有3个项目,因为表格有3列。

file = <html><p><center><h1>  Interference Report  </h1></center><p>
<b>  Interference Report Project File:  </b>C:\Users\ksobon\Documents\test_project_03_ksobon.rvt  <br>  <b>  Created:  </b>  Monday, May 26, 2014 7:52:32 PM  <br>  <b>  Last Update:  </b>    <br>
 <p><table border=on>  <tr>  <td></td>  <td ALIGN="center">A</td>  <td  ALIGN="center">B</td>  </tr>
<tr>  <td>  1  </td>  <td>  Workset1 : Walls : Basic Wall : E103-CON 100mm : id 469021     </td>  <td>  Workset1 : Furniture : FUR_BoardroomTable10Chairs_gm : Board Room Layout : id   482259  </td>  </tr>
<tr>  <td>  2  </td>  <td>  Workset1 : Walls : Basic Wall : E103-CON 100mm : id 469021    </td>  <td>  Workset1 : Walls : Basic Wall : E103-CON 100mm : id 483442  </td>  </tr>
<tr>  <td>  3  </td>  <td>  Workset1 : Walls : Basic Wall : E103-CON 100mm : id 469060    </td>  <td>  Workset1 : Furniture : FUR_Sofa_gm : 2100mm : id 475041  </td>  </tr>
<tr>  <td>  4  </td>  <td>  Workset1 : Walls : Basic Wall : E103-CON 100mm : id 469109   </td>  <td>  Workset1 : Furniture : FUR_Sofa_gm : 2100mm : id 475273  </td>  </tr>
<tr>  <td>  5  </td>  <td>  Workset1 : Walls : Basic Wall : E103-CON 100mm : id 469178   </td>  <td>  Workset1 : Furniture : FUR_Sofa_gm : 2100mm : id 475510  </td>  </tr>
<tr>  <td>  6  </td>  <td>  Workset1 : Walls : Basic Wall : E103-CON 100mm : id 469178    </td>  <td>  Workset1 : Furniture : FUR_Sofa_gm : 2100mm : id 482306  </td>  </tr>
<tr>  <td>  7  </td>  <td>  whatever : Doors : DOR_Single_gm : 800w, 2100h (720Leaf) -  Mark 102B : id 472052  </td>  <td>  Workset1 : Windows : WIN-ConceptWindowFixed_gm : 1200 H   x 1200 W - Mark 102B : id 472822  </td>  </tr>
<tr>  <td>  8  </td>  <td>  whatever : Doors : DOR_Single_gm : 800w, 2100h (720Leaf) -  Mark 101A : id 472376  </td>  <td>  Workset1 : Windows : WIN-ConceptWindowFixed_gm : 1200 H   x 1200 W - Mark 101C : id 472720  </td>  </tr>
<tr>  <td>  9  </td>  <td>  Workset1 : Windows : WIN-ConceptWindowFixed_gm : 1800 H x  1200 W 2 - Mark 101B : id 472688  </td>  <td>  Workset1 : Furniture : FUR_Sofa_gm : 2100mm   : id 482306  </td>  </tr>
</table>
<p><b>  End of Interference Report  </b>
</html>
来自BeautifulSoup的

导入BeautifulSoup     汤= BeautifulSoup(文件)     tag = soup.findAll(&#39; tr&#39;)

for i in tag:
    txt.append(i.findAll('td'))

现在我想将每个子列表元素转换为文本,所以我试过:     txt1 = [x中的x为x in x in x] 然而,我对txt1的输出是平面列表而不是列表列表。我究竟做错了什么?

1 个答案:

答案 0 :(得分:1)

i.text放入列表中:

txt1 = [[i.text] for x in txt for i in x] 

您正在使用列表解析将列表展平,将所有元素提取到一个列表中。

l = [[1,2],[2,3],[5,6]]

flatten_l = [x for y in l for x in y]
print (flatten_l)
[1, 2, 2, 3, 5, 6]

也许你需要地图:

l=[[1,2,4],[2,3,5],[5,6,7]]

print [map(str, s) for s in l]

[['1', '2', '4'], ['2', '3', '5'], ['5', '6', '7']]

使用你的代码,它会在维护结构的每个元素上调用i.text。

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(file)

tag = soup.findAll('tr')
txt=[(i.findAll('td')) for i in tag]
final=[[] for x in range(len(txt))]
for j,k in enumerate(txt):
    for i in k:
        final[j].append(i.text)  

 print final
 [[u'', u'A', u'B'], [u'1', u'Workset1 : Walls : Basic Wall : E103-CON 100mm : id 469021', u'Workset1 : Furniture : FUR_BoardroomTable10Chairs_gm : Board Room Layout......