BeautifulSoup只能刮我桌子的一半吗?

时间:2019-06-28 12:40:47

标签: python beautifulsoup

我正在使用BeautifulSoup刮擦一张桌子的网页,但是由于某种原因,它只刮擦了一半桌子。我得到的一半是不包含输入字段的部分。这是html数据:

<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
    <tbody>
        <tr>
            <th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
        </tr>
        <tr>
            <td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
        </tr>
        <tr>
            <td>
                <span>AdvisorGuided (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>AdvisorGuided 2 (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Client Directed (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Holding MMKT (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Total</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
    </tbody>
</table>

这是我的代码:


url = driver.page_source

soup = BeautifulSoup(url, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

我在做什么错?为什么只打印表格的左侧?任何有关更改内容的建议将不胜感激。

Results:

 Portfolio Allocation (%) 


AdvisorGuided (Capital Portfolio)
 100 100 




AdvisorGuided 2 (Capital Portfolio)
 0 100 




Client Directed (Capital Portfolio)
 0 100 




Holding MMKT (Capital Portfolio)
 0 100 




Total
 100 100

2 个答案:

答案 0 :(得分:1)

您将不得不进一步进入子节点和兄弟节点并拉出属性(这些值不是实际的文本/内容。

import pandas as pd
import bs4


html = '''<table class="commonTable1" cellpadding="0" cellspacing="0" border="0" width="100%" id="portAllocTable">
    <tbody>
        <tr>
            <th class="commonTableHeaderLastCell" colspan="2"><span class="commonBold"> Portfolio Allocation (%) </span></th>
        </tr>
        <tr>
            <td colspan="2" class="commonHeaderContentSeparator"><img src="/fees-web/common/images/spacer.gif" height="1" style="display: block"></td>
        </tr>
        <tr>
            <td>
                <span>AdvisorGuided (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[0].feeCollectionRate" value="100" id="selText_1"><input type="text" name="portfolioChargeList[0].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="100" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>AdvisorGuided 2 (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[1].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[1].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Client Directed (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[2].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[2].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Holding MMKT (Capital Portfolio)</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <!-- When collection method is invoice,  the portfolio to charge table should be diabled.
                    Else work as it was-->
                    <input type="hidden" name="portfolioChargeList[3].feeCollectionRate" value="0" id="selText_1"><input type="text" name="portfolioChargeList[3].feeCollectionRateINPUT" maxlength="3" onkeypress="return disableMinus();" onblur="updateTotal(1);" value="0" maxvalue="100" decimals="0" showalertdialog="true" blankifzero="true" id="selText_1INPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
        <tr>
            <td>
                <span>Total</span>
            </td>
            <td class="commonTableBodyLastCell" align="right">
                <span>
                    <input type="hidden" name="portfolioChargeList[4].feeCollectionRate" value="100" id="selText_1Total"><input type="text" name="portfolioChargeList[4].feeCollectionRateINPUT" maxlength="3" value="100" maxvalue="100" decimals="0" blankifzero="true" id="selText_1TotalINPUT" style="text-align:right;width:50px" class="commonTextBoxAmount">
                </span>
            </td>
        </tr>
    </tbody>
</table>'''


soup = bs4.BeautifulSoup(html, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll('td')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.find_all(["th","td"]):
        text = cell.text
        try:
            val = cell.find('input')['value']
            max_val = cell.find('input').next_sibling['maxvalue']
            list_of_cells.append(val)
            list_of_cells.append(max_val)
        except:
            pass
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

要创建表格,您可以执行以下操作。您将不得不进行一些清理,但应该可以使您前进:

results = pd.DataFrame()
for row in table.findAll('tr'):
    for cell in row.find_all(["th","td"]):
        text = cell.text
        try:
            val = cell.find('input')['value']
            max_val = cell.find('input').next_sibling['maxvalue']
        except:
            val = ''
            max_val = ''
            pass

        temp_df = pd.DataFrame([[text, val, max_val]], columns=['text','value','maxvalue'])
        results = results.append(temp_df).reset_index(drop=True)

答案 1 :(得分:0)

我想到了一些事情。

首先:它应该是rows = table.findAll('tr'),因为tr HTML标记指定了行。随后,它应该for row in table.findAll('td'):,因为td HTML标签是单元格标签。但是您甚至都没有使用rows变量,所以这很重要。如果您愿意,可以执行以下操作:

soup = BeautifulSoup(url, "lxml")
table = soup.find('table', id="portAllocTable")
rows = table.findAll("tr")

list_of_rows = []
for row in rows:
    list_of_cells = []
    for cell in row.findAll(['th', 'td']):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

第二,此代码不会在输入字段中获取文本,因此这可能就是为什么您只在左侧看到文本的原因。

最后,您可以尝试使用差异分析器,例如html5lib