下面是我从HTML文档中提取数据并将其放入变量的代码。我需要排除空白行以及“总计”行。我在代码下添加了这些段的HTML输入。我不知道如何让它发挥作用。我不能使用len()
,因为长度是可变的。有什么帮助吗?
from bs4 import BeautifulSoup
import urllib
import re
import HTMLParser
html = urllib.urlopen('RanpakAllocations.html').read()
parser = HTMLParser.HTMLParser()
#unescape doesn't seem to work
output = parser.unescape(html)
soup1 = BeautifulSoup(output, "html.parser")
Customer_No = []
Serial_No = []
data = []
#for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
rows = soup1.find_all("tr")
title = rows[0]
headers = rows[1]
datarows = rows[2:]
fields = []
try :
for row in datarows :
find_data = row.find_all(attrs={'face' : 'Arial,Helvetica,sans-serif'})
count = 0
for hit in find_data:
data = hit.text
count = count + 1
if count == 3 :
CSNO = data
if count == 9 :
ITNO = data
else :
continue
print CSNO, ITNO
print "new row"
except:
pass
这是输入。第一行<tr>
是我的最后一行数据,但是我的循环重复空白行和它下面的总行。
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">12</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">F5684</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20182</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">VELOCITY SOLUTIONS INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77717</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT 2</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">W/UNIVERSAL STAND S/N 51345563</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">51345563</font></td>
</tr>
<tr>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td align="left" colspan="5"><font size="1"> </font></td>
</tr>
<tr>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif"> </font></td>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif">Grand Total</font></td>
<td align="left" colspan="7"><font size="1"> </font></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
答案 0 :(得分:0)
我会做这样的事情:
from bs4 import BeautifulSoup
content = '''
<root>
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">12</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">F5684</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20182</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">VELOCITY SOLUTIONS INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77717</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT 2</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">W/UNIVERSAL STAND S/N 51345563</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">51345563</font></td>
</tr>
<tr>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td nowrap="nowrap" align="left"><font size="1"> </font></td>
<td align="left" colspan="5"><font size="1"> </font></td>
</tr>
<tr>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif"> </font></td>
<td align="left"><font size="3" face="Arial,Helvetica,sans-serif">Grand Total</font></td>
<td align="left" colspan="7"><font size="1"> </font></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</root>'''
soup = BeautifulSoup(content, 'html')
answer = []
rows = soup.find_all('tr')
for row in rows:
if not row.text.strip():
continue
row_text = []
for cell in row.find_all('td'):
if cell.text.strip():
row_text.append(cell.text)
answer.append(row_text)
print(answer)
<强>输出强>
[[u'12', u'F5684', u'20182', u'VELOCITY SOLUTIONS INC.', u'EQPRAN77717', u'RANPAK FILLPAK TT 2', u'W/UNIVERSAL STAND S/N 51345563', u'1', u'51345563'], [u'Grand Total']]
您可以使用if not row.text.strip(): continue
跳过整个空行(row.text.strip()
返回一个空字符串,其值为False
)。
对于迭代的行,您可以在保存相关文本之前使用if cell.text.strip()
检查每个单元格是否为空。