所以我有一张桌子:
<table border="1" style="width: 100%">
<caption></caption>
<col>
<col>
<tbody>
<tr>
<td>Pig</td>
<td>House Type</td>
</tr>
<tr>
<td>Pig A</td>
<td>Straw</td>
</tr>
<tr>
<td>Pig B</td>
<td>Stick</td>
</tr>
<tr>
<td>Pig C</td>
<td>Brick</td>
</tr>
我只是试图返回表对的JSON字符串,如下所示:
[["Pig A", "Straw"], ["Pig B", "Stick"], ["Pig C", "Brick"]]
但是,使用我的代码我似乎无法摆脱HTML标记:
stable = soup.find('table')
cells = [ ]
rows = stable.findAll('tr')
for tr in rows[1:4]:
# Process the body of the table
row = []
td = tr.findAll('td')
#td = [el.text for el in soup.tr.finall('td')]
row.append( td[0])
row.append( td[1])
cells.append( row )
return cells
#eventually,我想这样做: #h = json.dumps(单元格) #return h
我的输出是:
[[<td>Pig A</td>, <td>Straw</td>], [<td>Pig B</td>, <td>Stick</td>], [<td>Pig C</td>, <td>Brick</td>]]
答案 0 :(得分:2)
使用text
属性仅获取元素的内部文本:
row.append(td[0].text)
row.append(td[1].text)
答案 1 :(得分:0)
您可以尝试使用lxml库。
from lxml.html import fromstring
import lxml.html as PARSER
#data = open('example.html').read() # You can read it from a html file.
#OR
data = """
<table border="1" style="width: 100%">
<caption></caption>
<col>
<col>
<tbody>
<tr>
<td>Pig</td>
<td>House Type</td>
</tr>
<tr>
<td>Pig A</td>
<td>Straw</td>
</tr>
<tr>
<td>Pig B</td>
<td>Stick</td>
</tr>
<tr>
<td>Pig C</td>
<td>Brick</td>
</tr>
"""
root = PARSER.fromstring(data)
main_list = []
for ele in root.getiterator():
if ele.tag == "tr":
text = ele.text_content().strip().split('\n')
main_list.append(text)
print main_list
输出: [['Pig','House Type'],['Pig A','Straw'],['Pig B','Stick'],['Pig C','Brick']]