Question

我正试图从网站上的表格中获取一些信息http://www.house.gov/representatives/ 具体来说，我希望获得有关代表名称和＃34;的代表的信息。表。到目前为止，我可以从站点下载HTML并将其写入文件，但是当使用bs4解析并获取我想要的特定表时，它只抓取每个表的第一行。

这是因为HTML表的每一行都有一个额外的标记：

<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph  </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<BR>Armed Services<BR>Science, Space, and Technology</td>
</td>
</tr>

最后一个/ td标记以某种方式导致bs4不抓取其余行。我手动测试并删除了一些额外的标签，我收回了所有的行，所以我知道额外的标签是问题所在。到目前为止，这是我的python代码：

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'html.parser')
table = soup.select('table[title="Representative Directory By Last Name"]')
print(table)

我也试过使用美化（），但这也无济于事。关于如何清理HTML的任何想法，以便我可以使用bs4（或其他东西）来解析和提取我需要的表格？

谢谢！

Answer 1

您可以在代码中使用lxml解析器而不是html.parser：

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'lxml') #use the `lxml` parser instead of `html.parser`
table = soup.findAll("table",{"title":"Representative Directory By Last Name"})
print(table[0]) #print first table

输出会显示完整的第一个表格，其中包含＆＃34;标题＆＃34; =＆＃34;代表名称和＃34;：

<table class="directory" title="Representative Directory By Last Name">
<colgroup>
<col class="name"></col>
<col class="dist2"></col>
<col class="part"></col>
<col class="room"></col>
<col class="phone2"></col>
<col class="comm2"></col>
</colgroup>
<thead>
<tr>
<th>Name</th>
<th>District</th>
<th>Party</th>
<th>Room</th>
<th>Phone</th>
<th>Committee Assignment</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph  </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<br/>Armed Services<br/>Science, Space, and Technology</td>
</tr>
<tr>
<td><a href="http://adams.house.gov">
Adams, Alma </a>
</td>
<td>North Carolina 12th District</td>
<td>D</td>
<td>222 CHOB</td>
<td>202-225-1510</td>
<td>Agriculture<br/>Education and the Workforce<br/>Small Business</td>
</tr>
<tr>
<td><a href="https://aderholt.house.gov/">
Aderholt, Robert </a>
</td>
<td>Alabama 4th District</td>
<td>R</td>
<td>235 CHOB</td>
<td>202-225-4876</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="https://aguilar.house.gov/">
Aguilar, Pete </a>
</td>
<td>California 31st District</td>
<td>D</td>
<td>1223 LHOB</td>
<td>202-225-3201</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="http://allen.house.gov">
Allen, Rick </a>
</td>
<td>Georgia 12th District</td>
<td>R</td>
<td>426 CHOB</td>
<td>202-225-2823</td>
<td>Agriculture<br/>Education and the Workforce</td>
</tr>
<tr>
<td><a href="https://amash.house.gov/">
Amash, Justin </a>
</td>
<td>Michigan 3rd District</td>
<td>R</td>
<td>114 CHOB</td>
<td>202-225-3831</td>
<td>Oversight and Government</td>
</tr>
<tr>
<td><a href="https://amodei.house.gov">
Amodei, Mark </a>
</td>
<td>Nevada 2nd District</td>
<td>R</td>
<td>332 CHOB</td>
<td>202-225-6155</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="https://arrington.house.gov">
Arrington, Jodey  </a>
</td>
<td>Texas 19th District</td>
<td>R</td>
<td>1029 LHOB</td>
<td>202-225-4005</td>
<td>Agriculture<br/>the Budget<br/>Veterans' Affairs</td>
</tr>
</tbody>
</table>

额外的HTML标记导致bs4出现问题

1 个答案: