Question

我想从文件夹（本地保存）中的几个html文件中提取数据，并将信息保存到文本文件中。 Python中的大多数html工具箱似乎都处理在线网页而不是本地保存的文件。例如，如果我想找到＆＃34; CAS注册号码＆＃34;从所有文件中将这些文件写入文本文件我该怎么办？

包含信息的html行示例：

<DIV class=detailTitle><SPAN class=title>CAS Registry Number</SPAN> 555-34-0</DIV>

Answer 1

我建议使用PyQuery，它非常优雅地处理html元素'

教程是here

代码为：

from pyquery import PyQuery

html = open("index.html", 'r').read() # local html

query = pyquery(html)

query("li").eq(1).text()
......

Answer 2

最简单的方法是使用BeautifulSoup

a = open('file.html').read()


    from BeautifulSoup import BeautifulSoup
    bs = BeautifulSoup(a)
    //process the file as in normal cases