数据抓取本地存储的HTML文件 - 使用Python

时间:2016-10-13 16:56:44

标签: java python html web-scraping

我有一个很大的Excel文件,在每个单元格中我都有各种HTML内容,其中包含数据库用户发表的注释。每个单元格中的内容是唯一的并且长度不同。我需要摆脱所有HTML语法/标签,以便我可以将此内容上传到数据库表。如何使用Python(或Java,如果没有Python的答案)来抓取这些数据?你能提供一个代码示例吗?

1 个答案:

答案 0 :(得分:0)

在终端中pip install bs4。然后你可以像python一样提取文本:

import bs4

for cell in [
    '<html>The indicator lights on the control cabinet&nbsp;are to be replaced with 24Vdc&nbsp;LED\'s. 3 Red &amp;&nbsp;3 Green.</html>',
    '<html><div> <span style=""FONT-SIZE: 18pt"">Close the Monthly LAD and Lanyard Work orders to show they were executed. </span></div>']:
    print(bs4.BeautifulSoup(cell).text.strip())

结果:

The indicator lights on the control cabinet are to be replaced with 24Vdc LED's. 3 Red & 3 Green.
Close the Monthly LAD and Lanyard Work orders to show they were executed.