Question

我有一个html文件，我想从中提取所有表和h4元素。那就是我只想从文件中获取表格和h4并在其他地方使用它。我正在使用Notepad ++并寻找一些pythonscript来实现这一目标。

<html>
// header
<body>
  <div>
  <h4></h4>
  <h4></h4>
  <table>
    // some rows with cells here
    </table>
  // maybe some content here
  <table>
    // a form and other stuff
  </table>
  // probably some more text
 </div>
</body>
</html>

由于

Answer 1

我建议使用模块BeautifulSoup。

您可以通过以下方式完成您想要的任务：

    from bs4 import BeautifulSoup

    code = file("file.html")
    html = code.read()
    soup = BeautifulSoup(html)
    htag = soup.findall('h4')
    tabletag = soup.findall('table')
    for h in htag:
        print h.text
    for table in tabletag:
        print table.text

Answer 2

由于已经提到了BeautifulSoup，我只想暗示标准库的工具。

您可以使用builtin html parser或regular expression（请参阅tutorial）。

有时这些工具就足够了。这取决于任务。

BTW：Notepad ++支持正则表达式。<h4.*?/h4>或<table.*?/table>允许您选择这些块。 enter image description here

Answer 3

用于使用Python解析和编辑HTML的已建立的go-to库称为BeautifulSoup。

从html中提取所有表格和h4

3 个答案: