Question

我有一个很大的Excel文件，在每个单元格中我都有各种HTML内容，其中包含数据库用户发表的注释。每个单元格中的内容是唯一的并且长度不同。我需要摆脱所有HTML语法/标签，以便我可以将此内容上传到数据库表。如何使用Python（或Java，如果没有Python的答案）来抓取这些数据？你能提供一个代码示例吗？

Answer 1

在终端中pip install bs4。然后你可以像python一样提取文本：

import bs4

for cell in [
    '<html>The indicator lights on the control cabinet&nbsp;are to be replaced with 24Vdc&nbsp;LED\'s. 3 Red &amp;&nbsp;3 Green.</html>',
    '<html><div> <span style=""FONT-SIZE: 18pt"">Close the Monthly LAD and Lanyard Work orders to show they were executed. </span></div>']:
    print(bs4.BeautifulSoup(cell).text.strip())

结果：

The indicator lights on the control cabinet are to be replaced with 24Vdc LED's. 3 Red & 3 Green.
Close the Monthly LAD and Lanyard Work orders to show they were executed.

数据抓取本地存储的HTML文件 - 使用Python

1 个答案: