我正在使用BeautifulSoup刮擦一张桌子的网页,但是由于某种原因,它只刮擦了一半桌子。我得到的一半是不包含输入字段的部分。这是html数据:
<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
<tbody>
<tr>
<th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
</tr>
<tr>
<td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
</tr>
<tr>
<td>
<span>AdvisorGuided (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>AdvisorGuided 2 (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Client Directed (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Holding MMKT (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Total</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
</tbody>
</table>
这是我的代码:
url = driver.page_source
soup = BeautifulSoup(url, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll(["th","td"]):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
print(' '.join(item))
我在做什么错?为什么只打印表格的左侧?任何有关更改内容的建议将不胜感激。
Results:
Portfolio Allocation (%)
AdvisorGuided (Capital Portfolio)
100 100
AdvisorGuided 2 (Capital Portfolio)
0 100
Client Directed (Capital Portfolio)
0 100
Holding MMKT (Capital Portfolio)
0 100
Total
100 100
答案 0 :(得分:1)
您将不得不进一步进入子节点和兄弟节点并拉出属性(这些值不是实际的文本/内容。
import pandas as pd
import bs4
html = '''<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
<tbody>
<tr>
<th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
</tr>
<tr>
<td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
</tr>
<tr>
<td>
<span>AdvisorGuided (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>AdvisorGuided 2 (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Client Directed (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Holding MMKT (Capital Portfolio)</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<!-- When collection method is invoice, the portfolio to charge table should be diabled.
Else work as it was-->
<input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
<tr>
<td>
<span>Total</span>
</td>
<td class="commonTableBodyLastCell" align="right">
<span>
<input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
</span>
</td>
</tr>
</tbody>
</table>'''
soup = bs4.BeautifulSoup(html, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.find_all(["th","td"]):
text = cell.text
try:
val = cell.find('input')['value']
max_val = cell.find('input').next_sibling['maxvalue']
list_of_cells.append(val)
list_of_cells.append(max_val)
except:
pass
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
print(' '.join(item))
要创建表格,您可以执行以下操作。您将不得不进行一些清理,但应该可以使您前进:
results = pd.DataFrame()
for row in table.findAll('tr'):
for cell in row.find_all(["th","td"]):
text = cell.text
try:
val = cell.find('input')['value']
max_val = cell.find('input').next_sibling['maxvalue']
except:
val = ''
max_val = ''
pass
temp_df = pd.DataFrame([[text, val, max_val]], columns=['text','value','maxvalue'])
results = results.append(temp_df).reset_index(drop=True)
答案 1 :(得分:0)
我想到了一些事情。
首先:它应该是rows = table.findAll('tr')
,因为tr
HTML标记指定了行。随后,它应该for row in table.findAll('td'):
,因为td
HTML标签是单元格标签。但是您甚至都没有使用rows
变量,所以这很重要。如果您愿意,可以执行以下操作:
soup = BeautifulSoup(url, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll("tr")
list_of_rows = []
for row in rows:
list_of_cells = []
for cell in row.findAll(['th', 'td']):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
print(' '.join(item))
第二,此代码不会在输入字段中获取文本,因此这可能就是为什么您只在左侧看到文本的原因。
最后,您可以尝试使用差异分析器,例如html5lib
。