我正在尝试解析一个网页,我要在该网页上也刮擦具有bgcolor属性的“ tr”元素。以下是该网页的html:
<table cellspacing="0" cellpadding="15" id="MainContent_GridView1" style="color:#333333;border-collapse:collapse;">
<tr style="color:White;background-color:#045D99;font-weight:bold;">
<th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$name')" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$state')" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$year')" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$rt')" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$pc')" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ta')" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ein')" style="color:White;">EIN</a></th>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201702_990.pdf">Zoological Society of Philadelphia Philadelphia Zoo</a></td><td>PA</td><td>2017</td><td>990 </td><td align="right">68</td><td align="right">$124,163,973.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:White;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201602_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2016</td><td>990 </td><td align="right">61</td><td align="right">$125,008,026.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201502_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2015</td><td>990 </td><td align="right">63</td><td align="right">$131,880,929.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr>
</table>
我正在尝试使用样式元素抓取tr元素
style="color:White;background-color:#045D99;font-weight:bold;"
下面是我的代码:
import requests
from bs4 import BeautifulSoup
data = requests.get(url).text
soup = BeautifulSoup(data,"lxml")
elems = soup.find_all('tr',style"color:White;background-color:#045D99;font-weight:bold;")
但是我的元素返回空。同样在汤元素中,我看到了:
style="color:White;background-color:#045D99;font-weight:bold;"
已更改为
<tr bgcolor="#ECEEF2">
我不确定这是否是导致问题的原因,还有没有办法将整个表格作为pandas数据框抓取?
编辑:
我的代码中有错字,下面是正确的代码:
soup.find_all('tr',{"style":"color:White;background-color:#045D99;font-weight:bold;"})
与答案中提到的相同,但我仍然得到空结果
再进行一次编辑:
即使提出了建议,我仍然没有得到结果。 html来自以下网页:
http://990finder.foundationcenter.org/990results.aspx?990_type=&fn=AMERICAN+HEART+ASSOCIATION&st=&zp=&ei=&fy=&action=Search
我正在尝试解析网页中存在的表格
答案 0 :(得分:0)
更改代码的最后一行:
soup.find_all('tr',{"style":"color:White;background-color:#045D99;font-weight:bold;"})
您得到:
[<tr style="color:White;background-color:#045D99;font-weight:bold;">
<th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$name')" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$state')" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$year')" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$rt')" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$pc')" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ta')" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ein')" style="color:White;">EIN</a></th>
</tr>]
关于最后一个问题,假设t
存储了关注表的html,则可以使用pandas.read_html
将其转换为DataFrame
:
import pandas as pd
df = pd.read_html(t)
display(df[0])
您的情况是:
ORGANIZATION NAME STATE YEAR FORM PAGES TOTAL ASSETS EIN
0 Zoological Society of Philadelphia PA 2017 990 68 $124,163,973.00 23-1352298
1 Zoological Society of Philadelphia PA 2016 990 61 $125,008,026.00 23-1352298
2 Zoological Society of Philadelphia PA 2015 990 63 $131,880,929.00 23-1352298
答案 1 :(得分:0)
您的语法已关闭。更改为此:
elems = soup.find_all('tr', {"style":"color:White;background-color:#045D99;font-weight:bold;"})
完整代码:
data = '''<table cellspacing="0" cellpadding="15" id="MainContent_GridView1" style="color:#333333;border-collapse:collapse;">
<tr style="color:White;background-color:#045D99;font-weight:bold;">
<th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$name')" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$state')" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$year')" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$rt')" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$pc')" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ta')" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ein')" style="color:White;">EIN</a></th>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201702_990.pdf">Zoological Society of Philadelphia Philadelphia Zoo</a></td><td>PA</td><td>2017</td><td>990 </td><td align="right">68</td><td align="right">$124,163,973.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:White;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201602_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2016</td><td>990 </td><td align="right">61</td><td align="right">$125,008,026.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201502_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2015</td><td>990 </td><td align="right">63</td><td align="right">$131,880,929.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr>
</table>'''
import requests
from bs4 import BeautifulSoup
#data = requests.get(url).text
soup = BeautifulSoup(data,"lxml")
elems = soup.find_all('tr', {"style":"color:White;background-color:#045D99;font-weight:bold;"})
答案 2 :(得分:0)
我会毫不夸张地说。 background-color
不是属性,而是style
属性值的一部分。假设您想要一个包含该子字符串的字符串(也许是为了迎合不同的颜色),我们可以使用包含*运算符来匹配style
属性值
html = '''<table cellspacing="0" cellpadding="15" id="MainContent_GridView1" style="color:#333333;border-collapse:collapse;">
<tr style="color:White;background-color:#045D99;font-weight:bold;">
<th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$name')" style="color:White;">ORGANIZATION NAME</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$state')" style="color:White;">STATE</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$year')" style="color:White;">YEAR</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$rt')" style="color:White;">FORM</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$pc')" style="color:White;">PAGES</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ta')" style="color:White;">TOTAL ASSETS</a></th><th scope="col"><a href="javascript:__doPostBack('ctl00$MainContent$GridView1','Sort$ein')" style="color:White;">EIN</a></th>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201702_990.pdf">Zoological Society of Philadelphia Philadelphia Zoo</a></td><td>PA</td><td>2017</td><td>990 </td><td align="right">68</td><td align="right">$124,163,973.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:White;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201602_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2016</td><td>990 </td><td align="right">61</td><td align="right">$125,008,026.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr><tr style="color:#333333;background-color:#ECEEF2;">
<td><a href="//990s.foundationcenter.org/990_pdf_archive/231/231352298/231352298_201502_990.pdf">Zoological Society of Philadelphia</a></td><td>PA</td><td>2015</td><td>990 </td><td align="right">63</td><td align="right">$131,880,929.00</td><td style="white-space:nowrap;">23-1352298</td>
</tr>
</table>'''
import requests
from bs4 import BeautifulSoup as bs
soup = bs(html,"lxml")
trs = soup.select('tr[style*=";background-color:"]')