我是Python的初学者,正在使用我熟悉的数据来完成一些任务,以学习基础知识。我正在尝试浏览一个表以收集联系信息,但是在获取tds列表中的数据时遇到了问题。
HTML看起来像这样:
<table class="table table-striped" data-drupal-selector="edit-directory" id="edit-directory--zJwP9mT4moQ">
<thead>
<tr>
<th>Name</th>
<th>Job Title</th>
<th>Campus/Department</th>
<th>Contact</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>LAST, FIRST</td>
<td>T-HS SCI- GEN'L</td>
<td><span tabindex="0">SCHOOL</span></td>
<td><a href="mailto:teacher@school.org" class="email"><span aria-hidden="true">Email</span><span class="sr-only">teacher@school.org</span></a><br>555-555-5555</td>
</tr>
</table>
我有这段代码来获取表格
data = urllib.parse.urlencode(params).encode("utf-8")
req = urllib.request.Request(url)
with urllib.request.urlopen(req,data=data) as f:
soup = bs(f, 'html.parser')
table = soup.find("table")
for row in table.findAll("tr"):
#print (row)
cells = row.findAll("td")
print(cells)
我得到这样的东西:
[<td>LAST,FIRST </td>, <td>TEMP PROF</td>, <td><span tabindex="0">SCHOOL</span></td>, <td><a class="email" href="mailto:teacher@school.org"><span aria-hidden="true">Email</span><span class="sr-only">teacher@school.org</span></a><br/>555-555-5555</td>]
[<td><a href="https://teachersite.com" target="_blank">LAST, FIRST</a></td>, <td>T-ENGLISH</td>, <td><span tabindex="0">SCHOOL</span></td>, <td><a class="email" href="mailto:teacher@school.org"><span aria-hidden="true">Email</span><span class="sr-only">teacher@school.org/span></a><br/>555-555-5555</td>]
但是,如果我尝试获得列表中的数据:
print (cells[1])
它表示索引超出范围
我想要得到的是这样的:
last = 'LAST'
first = 'FIRST'
email = 'teacher@school.com'
title = 'TEMP PROF'
phone = '555-555-5555'
答案 0 :(得分:1)
似乎您想从每个元素中剥离文本:
for row in table.findAll('tr'):
cols = row.findAll('td')
cols = [element.text.strip() for element in cols]
for col in cols:
print(col)
要查找名字和姓氏,可以使用.split(', ')
用逗号和空格分隔第一个元素。希望这会为您指明正确的方向!
答案 1 :(得分:0)
您可以为每个td
遍历tr
并获取所需的数据:
from bs4 import BeautifulSoup as soup
def scrape_td(d):
n, t, _, c = d.find_all('td')
return {**dict(zip(['last', 'first'], n.text.split(', '))), 'title':t.text, 'email':c.contents[0]['href'][7:], 'phone':c.contents[-1]}
results = list(map(scrape_td, soup(html, 'html.parser').find('table', {'id':'edit-directory--zJwP9mT4moQ'}).find_all('tr')[1:]))
输出:
[{'last': 'LAST', 'first': 'FIRST', 'title': "T-HS SCI- GEN'L", 'email': 'teacher@school.org', 'phone': '555-555-5555'}]