将BeautifulSoup与基本表一起使用的选项 - 没有类ID,

时间:2015-11-25 15:02:31

标签: python beautifulsoup html-table

当你的表没有类或属性值时,有没有推荐的方法在python中使用BeautifulSoup 4?

我正在考虑使用Get_Text()来转储文本,但是如果我想选择单个值或将表拆分为更离散的部分,我将如何处理它?<​​/ p>

<table cellpadding="0" cellspacing="0" id="programmeDescriptor" width="100%">
  <tr>
    <td>
      <table cellpadding="5" cellspacing="0" class="borders" width="100%">
        <tr>
          <th colspan="1">
            Awards
          </th>
        </tr>
        <tr>
        </tr>
        <tr>
          <td>
            Ordinary Bachelor Degree
          </td>
        </tr>
      </table>
      <table border="0" cellpadding="0" cellspacing="0" width="100%">
        <tr>
          <td>
            <table cellpadding="5" cellspacing="0" class="borders">
              <tr>
                <th width="160">
                  Programme Code:
                </th>
                <td width="150">
                  CodeValue
                </td>
              </tr>
            </table>
          </td>
          <td width="5">
          </td>
          <td>
            <table cellpadding="5" cellspacing="0" class="borders">
              <tr>
                <th width="160">
                  Mode of Delivery:
                </th>
                <td width="150">
                  Full Time
                </td>
              </tr>
            </table>
          </td>
          <td width="5">
          </td>
          <td>
            <table cellpadding="5" cellspacing="0" class="borders">
              <tr>
                <th width="160">
                  No. of Semesters:
                </th>
                <td width="150">
                  6
                </td>
              </tr>
            </table>
          </td>
        </tr>
        <tr>
          <td>
            <table cellpadding="5" cellspacing="0" class="borders">
              <tr>
                <th width="160">
                  NFQ Level:
                </th>
                <td width="150">
                  7
                </td>
              </tr>
            </table>
          </td>
        </tr>
        <tr>
          <td>
            <table cellpadding="5" cellspacing="0" class="borders">
              <tr>
                <th width="160">
                  Embedded Award:
                </th>
                <td width="150">
                  No
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
      <table cellpadding="5" cellspacing="0" class="borders" width="100%">
        <tr>
          <th width="160">
            Department:
          </th>
          <td>
            Computing
          </td>
        </tr>
      </table>
      <div class="pageBreak">
      </div>
      <h3>
    Programme Outcomes
   </h3>
      <p class="info">
        On successful completion of this programme the learner will be able to :
      </p>
      <table cellpadding="5" cellspacing="0" class="borders" width="100%">
        <tr>
          <th width="30">
            PO1
          </th>
          <td class="head" colspan="2">
            Knowledge - Breadth
          </td>
        </tr>
        <tr>
          <td class="head" width="30">
          </td>
          <td class="head" width="30">
            (a)
          </td>
          <td>
            • Some block of text
        </tr>
        <tr>
          <th width="30">
            PO2
          </th>
          <td class="head" colspan="2">
            Knowledge - Kind
          </td>
        </tr>
        <tr>
          <td class="head" width="30">
          </td>
          <td class="head" width="30">
            (a)
          </td>
          <td>
            • Some block of text
          </td>
        </tr>
        <tr>
          <th width="30">
            PO3
          </th>
          <td class="head" colspan="2">
            Skill - Range
          </td>
        </tr>
        <tr>
          <td class="head" width="30">
          </td>
          <td class="head" width="30">
            (a)
          </td>
          <td>
            • Some block of text
          </td>
        </tr>
        <tr>
          <th width="30">
            PO4
          </th>
          <td class="head" colspan="2">
            Skill - Selectivity
          </td>
        </tr>
        <tr>
          <td class="head" width="30">
          </td>
          <td class="head" width="30">
            (a)
          </td>
          <td>
            • Some block of text
          </td>
        </tr>
        <tr>
          <th width="30">
            PO5
          </th>
          <td class="head" colspan="2">
            Competence - Context
          </td>
        </tr>
        <tr>
          <td class="head" width="30">
          </td>
          <td class="head" width="30">
            (a)
          </td>
          <tdSome block of text </td>
        </tr>
        <tr>
          <th width="30">
            PO6
          </th>
          <td class="head" colspan="2">
            Competence - Role
          </td>
        </tr>
        <tr>
          <td class="head" width="30">
          </td>
          <td class="head" width="30">
            (a)
          </td>
          <td>
            • Some block of text
          </td>
        </tr>
        <tr>
          <th width="30">
            PO7
          </th>
          <td class="head" colspan="2">
            Competence - Learning to Learn
          </td>
        </tr>
        <tr>
          <td class="head" width="30">
          </td>
          <td class="head" width="30">
            (a)
          </td>
          <td>
            • Some block of text
          </td>
        </tr>
        <tr>
          <th width="30">
            PO8
          </th>
          <td class="head" colspan="2">
            Competence - Insight
          </td>
        </tr>
        <tr>
          <td class="head" width="30">
          </td>
          <td class="head" width="30">
            (a)
          </td>
          <td>
            • The graduate will demonstrate the ability to specify, design and build an IT system or research &amp; report on a current IT topic
          </td>
        </tr>
      </table>
      <div class="pageBreak">
      </div>
      <h3>
    Semester Schedules
   </h3>
      <table cellpadding="0" cellspacing="0" width="100%">
        <tr>
          <td colspan="2">
            <h4>
       Stage 1 / Semester 1
      </h4>
          </td>
        </tr>
        <tr>
          <td colspan="2">
            <table cellpadding="5" cellspacing="0" class="borders" width="100%">
              <tr>
                <td class="head" colspan="2">
                  Mandatory
                </td>
              </tr>
              <tr>
                <th width="50">
                  Module Code
                </th>
                <th>
                  Module Title
                </th>
              </tr>
              <tr>
                <td>
                  Code 
                </td>
                <td
                  <a href="index.cfm/page/module/moduleId/3897" target="_blank">
          Web &amp; User Experience
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3881" target="_blank">
          Software Development 1
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/1645" target="_blank">
          Computer Architecture
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2328" target="_blank">
          Discrete Mathematics 1
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3848" target="_blank">
          Business &amp; Information Systems
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2054" target="_blank">
          Learning to Learn at Third Level
         </a>
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
      <table cellpadding="0" cellspacing="0" width="100%">
        <tr>
          <td colspan="2">
            <h4>
       Stage 1 / Semester 2
      </h4>
          </td>
        </tr>
        <tr>
          <td colspan="2">
            <table cellpadding="5" cellspacing="0" class="borders" width="100%">
              <tr>
                <td class="head" colspan="2">
                  Mandatory
                </td>
              </tr>
              <tr>
                <th width="50">
                  Module Code
                </th>
                <th>
                  Module Title
                </th>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3886" target="_blank">
          Software Development 2
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3895" target="_blank">
          Object Oriented Systems Analysis
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3875" target="_blank">
          Database Fundamentals
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3874" target="_blank">
          Operating Systems Fundamentals
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2330" target="_blank">
          Statistics
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2527" target="_blank">
          Social Media Communications
         </a>
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
      <div class="pageBreak">
      </div>
      <table cellpadding="0" cellspacing="0" width="100%">
        <tr>
          <td colspan="2">
            <h4>
       Stage 2 / Semester 1
      </h4>
          </td>
        </tr>
        <tr>
          <td colspan="2">
            <table cellpadding="5" cellspacing="0" class="borders" width="100%">
              <tr>
                <td class="head" colspan="2">
                  Mandatory
                </td>
              </tr>
              <tr>
                <th width="50">
                  Module Code
                </th>
                <th>
                  Module Title
                </th>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3877" target="_blank">
          Web &amp; Mobile Design &amp; Development
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3876" target="_blank">
          Database Design And Programming
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3869" target="_blank">
          Software Development 3
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3873" target="_blank">
          Software Quality Assurance and Testing
         </a>
                </td>
              </tr>
              <tr>
                <td>
                 Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3629" target="_blank">
          Networking 1
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2477" target="_blank">
          Discrete Mathematics 2
         </a>
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
      <table cellpadding="0" cellspacing="0" width="100%">
        <tr>
          <td colspan="2">
            <h4>
       Stage 2 / Semester 2
      </h4>
          </td>
        </tr>
        <tr>
          <td colspan="2">
            <table cellpadding="5" cellspacing="0" class="borders" width="100%">
              <tr>
                <td class="head" colspan="2">
                  Mandatory
                </td>
              </tr>
              <tr>
                <th width="50">
                  Module Code
                </th>
                <th>
                  Module Title
                </th>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3862" target="_blank">
          Project
         </a>
                </td>
              </tr>
              <tr>
                <td>
                 Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3911" target="_blank">
          Object Oriented Analysis &amp; Design 1
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3877" target="_blank">
          Web &amp; Mobile Design &amp; Development
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3630" target="_blank">
          Networking 2
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3870" target="_blank">
          Software Development 4
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2476" target="_blank">
          Management Science
         </a>
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
      <div class="pageBreak">
      </div>
      <table cellpadding="0" cellspacing="0" width="100%">
        <tr>
          <td colspan="2">
            <h4>
       Stage 3 / Semester 1
      </h4>
          </td>
        </tr>
        <tr>
          <td colspan="2">
            <table cellpadding="5" cellspacing="0" class="borders" width="100%">
              <tr>
                <td class="head" colspan="2">
                  Mandatory
                </td>
              </tr>
              <tr>
                <th width="50">
                  Module Code
                </th>
                <th>
                  Module Title
                </th>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3911" target="_blank">
          Object Oriented Analysis &amp; Design 1
         </a>
                </td>
              </tr>
              <tr>
                <td>
                 Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3899" target="_blank">
          Operating Systems
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/1721" target="_blank">
          Cloud Services &amp; Distributed Computing
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2580" target="_blank">
          Innovation &amp; Entrepreneurship
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3878" target="_blank">
          Web Application Development
         </a>
                </td>
              </tr>
              <tr>
                <td>
                 Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/1689" target="_blank">
          Algorithms and Data Structures 1
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2025" target="_blank">
          Logic and Problem Solving
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3896" target="_blank">
          Advanced Databases
         </a>
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
      <table cellpadding="0" cellspacing="0" width="100%">
        <tr>
          <td colspan="2">
            <h4>
       Stage 3 / Semester 2
      </h4>
          </td>
        </tr>
        <tr>
          <td colspan="2">
            <table cellpadding="5" cellspacing="0" class="borders" width="100%">
              <tr>
                <td class="head" colspan="2">
                  Mandatory
                </td>
              </tr>
              <tr>
                <th width="50">
                  Module Code
                </th>
                <th>
                  Module Title
                </th>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2465" target="_blank">
          Project
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/1728" target="_blank">
          Algorithms and Data Structures 2
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/1675" target="_blank">
          Network Management
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2025" target="_blank">
          Logic and Problem Solving
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/3899" target="_blank">
          Operating Systems
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/2580" target="_blank">
          Innovation &amp; Entrepreneurship
         </a>
                </td>
              </tr>
              <tr>
                <td>
                  Code
                </td>
                <td>
                  <a href="index.cfm/page/module/moduleId/1679" target="_blank">
          Object Oriented Analysis &amp; Design 2
         </a>
                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
      </td>
  </tr>
</table>

2 个答案:

答案 0 :(得分:1)

您可以迭代某些标签。我不知道你想做什么,但如果你想获得每个<th>标签的文本,那么只需迭代它们,然后使用get_text()

答案 1 :(得分:1)

首先,所有表的父表都有一个id属性 - 让它成为搜索的基础:

super_table = soup.find("table", id="programmeDescriptor")

然后,根据您在评论中提到的内容,它看起来像您可以通过它的标题来区分每个内部表。实现此逻辑的一个选项是找到标头,然后使用find_parent()查找父表:

def get_table_by_header_name(super_table, header):
    return super_table.find("th", text=header).find_parent("table")

用法:

desired_table = get_table_by_header_name(super_table, "Awards")