将pandas read_html变形为更简单的结构

时间:2017-01-15 16:06:04

标签: python pandas python-3.5 bs4

我希望有人可以告诉我如何创建仅包含第2列而不是前2行或左列的文本的pandas数据框。该解决方案需要能够处理多个类似的表。

我原以为pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4')从html(跳过2行)创建数据帧列表就是方法,但最终数据结构对于这个新手来说太难以理解或操纵到更简单的结构。

其他人是否有办法处理结果或推荐其他方法来改进数据,所以我最终得到的1列只包含我需要的文字?

样本表

<table cellpadding="5" cellspacing="0" class="borders" width="100%">
    <tr>
     <th colspan="2">
      Learning Outcomes
     </th>
    </tr>
    <tr>
     <td class="info" colspan="2">
      On successful completion of this module the learner will be able to:
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO1
     </td>
     <td>
      Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO2
     </td>
     <td>
      Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO3
     </td>
     <td>
      Understand the various formats in which  information in relation to transactions or events is recorded and classified.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO4
     </td>
     <td>
      Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the  posting of  recorded information to the T accounts in the Nominal Ledger.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO5
     </td>
     <td>
      Prepare and present the financial statements of a Sole Trader  in prescribed format from a Trial Balance  accompanies by notes with additional information.
     </td>
    </tr>
   </table> 

1 个答案:

答案 0 :(得分:1)

第一个选项
使用iloc

这应该让iloc摆脱第一列

pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1:]

解释

...iloc[:, 1:]
#       ^   ^
#       |    \
# says to    says to take columns
# take all   starting with one and on
# rows

您可以只使用

的单列
pd.read_html(LOTable.prettify(),skiprows=2, flavor='bs4').iloc[:, 1]

我运行的代码

htm = """<table cellpadding="5" cellspacing="0" class="borders" width="100%">
    <tr>
     <th colspan="2">
      Learning Outcomes
     </th>
    </tr>
    <tr>
     <td class="info" colspan="2">
      On successful completion of this module the learner will be able to:
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO1
     </td>
     <td>
      Demonstrate an awareness of the important role of Financial Accounting information as an input into the decision making process.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO2
     </td>
     <td>
      Display an understanding of the fundamental accounting concepts, principles and conventions that underpin the preparation of Financial statements.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO3
     </td>
     <td>
      Understand the various formats in which  information in relation to transactions or events is recorded and classified.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO4
     </td>
     <td>
      Apply a knowledge of accounting concepts,conventions and techniques such as double entry to the  posting of  recorded information to the T accounts in the Nominal Ledger.
     </td>
    </tr>
    <tr>
     <td style="width:10%;">
      LO5
     </td>
     <td>
      Prepare and present the financial statements of a Sole Trader  in prescribed format from a Trial Balance  accompanies by notes with additional information.
     </td>
    </tr>
   </table> """

pd.read_html(htm,skiprows=2, flavor='bs4')[0].iloc[:, 1:]

enter image description here