使用背景颜色样式BeautifulSoup刮除td元素

时间:2019-06-05 10:21:59

标签: python python-3.x pandas html-table beautifulsoup

我正在尝试解析一个网页,我要在该网页上也刮擦具有bgcolor属性的“ tr”元素。以下是该网页的html:

<table cellspacing="0" cellpadding="15" id="MainContent_GridView1" style="color:#333333;border-collapse:collapse;">
    <tr style="color:White;background-color:#045D99;font-weight:bold;">
        <th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$name&#39;)" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$state&#39;)" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$year&#39;)" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$rt&#39;)" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$pc&#39;)" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$ta&#39;)" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$ein&#39;)" style="color:White;">EIN</a></th>
    </tr><tr style="color:#333333;background-color:#ECEEF2;">
        <td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201702_990.pdf">Zoological Society of Philadelphia Philadelphia Zoo</a></td><td>PA</td><td>2017</td><td>990   </td><td align="right">68</td><td align="right">$124,163,973.00</td><td style="white-space:nowrap;">23-1352298</td>
    </tr><tr style="color:#333333;background-color:White;">
        <td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201602_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2016</td><td>990   </td><td align="right">61</td><td align="right">$125,008,026.00</td><td style="white-space:nowrap;">23-1352298</td>
    </tr><tr style="color:#333333;background-color:#ECEEF2;">
        <td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201502_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2015</td><td>990   </td><td align="right">63</td><td align="right">$131,880,929.00</td><td style="white-space:nowrap;">23-1352298</td>
    </tr>
</table>

我正在尝试使用样式元素抓取tr元素

style="color:White;background-color:#045D99;font-weight:bold;"

下面是我的代码:

import requests
from bs4 import BeautifulSoup
data = requests.get(url).text
soup = BeautifulSoup(data,"lxml")
elems = soup.find_all('tr',style"color:White;background-color:#045D99;font-weight:bold;")

但是我的元素返回空。同样在汤元素中,我看到了:

style="color:White;background-color:#045D99;font-weight:bold;"

已更改为

<tr bgcolor="#ECEEF2">

我不确定这是否是导致问题的原因,还有没有办法将整个表格作为pandas数据框抓取?

编辑:

我的代码中有错字,下面是正确的代码:

soup.find_all('tr',{"style":"color:White;background-color:#045D99;font-weight:bold;"})

与答案中提到的相同,但我仍然得到空结果

再进行一次编辑:

即使提出了建议,我仍然没有得到结果。 html来自以下网页:

http://990finder.foundationcenter.org/990results.aspx?990_type=&fn=AMERICAN+HEART+ASSOCIATION&st=&zp=&ei=&fy=&action=Search

我正在尝试解析网页中存在的表格

3 个答案:

答案 0 :(得分:0)

更改代码的最后一行:

soup.find_all('tr',{"style":"color:White;background-color:#045D99;font-weight:bold;"})

您得到:

[<tr style="color:White;background-color:#045D99;font-weight:bold;">
 <th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$name')" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$state')" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$year')" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$rt')" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$pc')" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ta')" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ein')" style="color:White;">EIN</a></th>
 </tr>]

关于最后一个问题,假设t存储了关注表的html,则可以使用pandas.read_html将其转换为DataFrame

import pandas as pd

df = pd.read_html(t)

display(df[0])

您的情况是:

                  ORGANIZATION NAME   STATE YEAR    FORM PAGES  TOTAL ASSETS       EIN
0   Zoological Society of Philadelphia  PA  2017    990    68   $124,163,973.00 23-1352298
1   Zoological Society of Philadelphia  PA  2016    990    61   $125,008,026.00 23-1352298
2   Zoological Society of Philadelphia  PA  2015    990    63   $131,880,929.00 23-1352298

答案 1 :(得分:0)

您的语法已关闭。更改为此:

elems = soup.find_all('tr', {"style":"color:White;background-color:#045D99;font-weight:bold;"})

完整代码:

data = '''<table cellspacing="0" cellpadding="15" id="MainContent_GridView1" style="color:#333333;border-collapse:collapse;">
    <tr style="color:White;background-color:#045D99;font-weight:bold;">
        <th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$name&#39;)" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$state&#39;)" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$year&#39;)" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$rt&#39;)" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$pc&#39;)" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$ta&#39;)" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$ein&#39;)" style="color:White;">EIN</a></th>
    </tr><tr style="color:#333333;background-color:#ECEEF2;">
        <td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201702_990.pdf">Zoological Society of Philadelphia Philadelphia Zoo</a></td><td>PA</td><td>2017</td><td>990   </td><td align="right">68</td><td align="right">$124,163,973.00</td><td style="white-space:nowrap;">23-1352298</td>
    </tr><tr style="color:#333333;background-color:White;">
        <td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201602_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2016</td><td>990   </td><td align="right">61</td><td align="right">$125,008,026.00</td><td style="white-space:nowrap;">23-1352298</td>
    </tr><tr style="color:#333333;background-color:#ECEEF2;">
        <td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201502_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2015</td><td>990   </td><td align="right">63</td><td align="right">$131,880,929.00</td><td style="white-space:nowrap;">23-1352298</td>
    </tr>
</table>'''


import requests
from bs4 import BeautifulSoup
#data = requests.get(url).text
soup = BeautifulSoup(data,"lxml")
elems = soup.find_all('tr', {"style":"color:White;background-color:#045D99;font-weight:bold;"})

答案 2 :(得分:0)

我会毫不夸张地说。 background-color不是属性,而是style属性值的一部分。假设您想要一个包含该子字符串的字符串(也许是为了迎合不同的颜色),我们可以使用包含*运算符来匹配style属性值

html = '''<table cellspacing="0" cellpadding="15" id="MainContent_GridView1" style="color:#333333;border-collapse:collapse;">
    <tr style="color:White;background-color:#045D99;font-weight:bold;">
        <th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$name&#39;)" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$state&#39;)" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$year&#39;)" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$rt&#39;)" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$pc&#39;)" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$ta&#39;)" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack(&#39;ctl00$MainContent$GridView1&#39;,&#39;Sort$ein&#39;)" style="color:White;">EIN</a></th>
    </tr><tr style="color:#333333;background-color:#ECEEF2;">
        <td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201702_990.pdf">Zoological Society of Philadelphia Philadelphia Zoo</a></td><td>PA</td><td>2017</td><td>990   </td><td align="right">68</td><td align="right">$124,163,973.00</td><td style="white-space:nowrap;">23-1352298</td>
    </tr><tr style="color:#333333;background-color:White;">
        <td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201602_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2016</td><td>990   </td><td align="right">61</td><td align="right">$125,008,026.00</td><td style="white-space:nowrap;">23-1352298</td>
    </tr><tr style="color:#333333;background-color:#ECEEF2;">
        <td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201502_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2015</td><td>990   </td><td align="right">63</td><td align="right">$131,880,929.00</td><td style="white-space:nowrap;">23-1352298</td>
    </tr>
</table>'''


import requests
from bs4 import BeautifulSoup as bs
soup = bs(html,"lxml")
trs = soup.select('tr[style*=";background-color:"]')