增加id的python web scrap

时间:2017-06-06 01:59:39

标签: python beautifulsoup

试图删除列出每个学生(tr)及其数据(td)的数据(表格)的页面。

tr中列出的每个学生都有自己唯一的ID标记,每个学生的标记加1。

示例:1234-1,1234-2,1234-3等..

我试图通过将count变量递增1来添加id。此外,输出仅提供第一个td而不是所有td' s。

我是python的新手,也是webscraping,不知道为什么这不起作用。任何帮助将不胜感激

import csv
import requests
from bs4 import BeautifulSoup

url = '' # Has been left blank for a reason
response = requests.get(url)
html = response.content

count = 1

print ('-' * 30)

soup = BeautifulSoup(html, "html.parser")
table = soup.find('tr', attrs={'id': '1234-' + str(count)})

list_of_cells = []

while True:
    for cell in table.findAll('td'):
        text = cell.text.replace('\xa0', '')
        list_of_cells.append(text)
    list_of_cells.append(list_of_cells)

    student_name = list_of_cells[0]
    agent_id = list_of_cells[3].replace('-', '')

    total_hrs = list_of_cells[14]
    total_inc = list_of_cells[15]

    count += 1

    print (student_name, "| ", total_hrs, " ", total_inc)
else:
    print('Done')

表格中的tr的例子..

<tr height="17" id="1234-1" style="height:12.75pt;display:none">
  <td class="xl243045" height="17" style="height:12.75pt;border-top:none">
    <a href="48701">Student Name</a>
  </td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
  <td style="border-top:none;border-left:none">stuff</td>
</tr>

2 个答案:

答案 0 :(得分:1)

美丽的汤让你通过正则表达式选择。所以你可以这样做:

 import re

 # if you copy and paste this be wary of the "-" it doesn't appear to be a standard "-" on a US keyboard.  Make it match whatever is in the html
 students = soup.find_all("tr",id=re.compile(r'\d{4}-\d+'))
 for student in students:
    cells = student.find_all("td")
    student_name = cells[0].find('a').text
    total_hrs = cells[14].text
    print("{0}|{1}".format(student_name, total_hrs))

但我猜你的桌子可能只是满满的学生。如果是,那么这可能更有意义,也更容易理解:

#access the actual table holding the rows not the row itself -- notice the parent
table = soup.find('tr', attrs={'id': '1234-1'}).parent

# iterate over each of the rows (students)
for row in table.find_all("tr"):
    cells = row.find_all("td")
    student_name = cells[0].find('a').text
    total_hrs = cells[14].text
    print("{0}|{1}".format(student_name, total_hrs))
顺便说一下,依靠桌上的学生ID可能不是最好的主意。  学生通常会改变。找到能识别拿着学生的桌子的东西,而不是依赖于特定的学生ID,这可能是一个更好的主意。

答案 1 :(得分:0)

table = soup.find('tr', attrs={'id': '1234-' + str(count)})必须位于您增加count的循环中。

count = 1

print ('-' * 30)

soup = BeautifulSoup(html, "html.parser")

list_of_cells = []

while True:
    table = soup.find('tr', attrs={'id': '1234-' + str(count)})
    for cell in table.findAll('td'):
        text = cell.text.replace('\xa0', '')
        list_of_cells.append(text)
    list_of_cells.append(list_of_cells)

    student_name = list_of_cells[0]
    agent_id = list_of_cells[3].replace('-', '')

    total_hrs = list_of_cells[14]
    total_inc = list_of_cells[15]

    count += 1

    print (student_name, "| ", total_hrs, " ", total_inc)
else:
    print('Done')