试图删除列出每个学生(tr)及其数据(td)的数据(表格)的页面。
tr中列出的每个学生都有自己唯一的ID标记,每个学生的标记加1。
示例:1234-1,1234-2,1234-3等..
我试图通过将count变量递增1来添加id。此外,输出仅提供第一个td而不是所有td' s。
我是python的新手,也是webscraping,不知道为什么这不起作用。任何帮助将不胜感激
import csv
import requests
from bs4 import BeautifulSoup
url = '' # Has been left blank for a reason
response = requests.get(url)
html = response.content
count = 1
print ('-' * 30)
soup = BeautifulSoup(html, "html.parser")
table = soup.find('tr', attrs={'id': '1234-' + str(count)})
list_of_cells = []
while True:
for cell in table.findAll('td'):
text = cell.text.replace('\xa0', '')
list_of_cells.append(text)
list_of_cells.append(list_of_cells)
student_name = list_of_cells[0]
agent_id = list_of_cells[3].replace('-', '')
total_hrs = list_of_cells[14]
total_inc = list_of_cells[15]
count += 1
print (student_name, "| ", total_hrs, " ", total_inc)
else:
print('Done')
表格中的tr的例子..
<tr height="17" id="1234-1" style="height:12.75pt;display:none">
<td class="xl243045" height="17" style="height:12.75pt;border-top:none">
<a href="48701">Student Name</a>
</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
<td style="border-top:none;border-left:none">stuff</td>
</tr>
答案 0 :(得分:1)
美丽的汤让你通过正则表达式选择。所以你可以这样做:
import re
# if you copy and paste this be wary of the "-" it doesn't appear to be a standard "-" on a US keyboard. Make it match whatever is in the html
students = soup.find_all("tr",id=re.compile(r'\d{4}-\d+'))
for student in students:
cells = student.find_all("td")
student_name = cells[0].find('a').text
total_hrs = cells[14].text
print("{0}|{1}".format(student_name, total_hrs))
但我猜你的桌子可能只是满满的学生。如果是,那么这可能更有意义,也更容易理解:
#access the actual table holding the rows not the row itself -- notice the parent
table = soup.find('tr', attrs={'id': '1234-1'}).parent
# iterate over each of the rows (students)
for row in table.find_all("tr"):
cells = row.find_all("td")
student_name = cells[0].find('a').text
total_hrs = cells[14].text
print("{0}|{1}".format(student_name, total_hrs))
顺便说一下,依靠桌上的学生ID可能不是最好的主意。
学生通常会改变。找到能识别拿着学生的桌子的东西,而不是依赖于特定的学生ID,这可能是一个更好的主意。
答案 1 :(得分:0)
行table = soup.find('tr', attrs={'id': '1234-' + str(count)})
必须位于您增加count
的循环中。
count = 1
print ('-' * 30)
soup = BeautifulSoup(html, "html.parser")
list_of_cells = []
while True:
table = soup.find('tr', attrs={'id': '1234-' + str(count)})
for cell in table.findAll('td'):
text = cell.text.replace('\xa0', '')
list_of_cells.append(text)
list_of_cells.append(list_of_cells)
student_name = list_of_cells[0]
agent_id = list_of_cells[3].replace('-', '')
total_hrs = list_of_cells[14]
total_inc = list_of_cells[15]
count += 1
print (student_name, "| ", total_hrs, " ", total_inc)
else:
print('Done')