Question

我已将页面HTML抓取并存储在本地驱动器上。我现在需要使用Python Newspaper（Ver 0.1.2）和Python（Ver 2.7.10）提取内容，标题，图像等信息。我无法在互联网上找到与此相关的任何内容。我如何实现上述目标？

Answer 1

您可能已经解决了此问题。但这是一种使用 Newspaper 解析存储的HTML文件的方法。

from newspaper import Article

article = Article('')
article.set_html(open("cnn_article.html").read())
article.parse()
title = article.title
authors = article.authors
text = article.text
keywords = article.meta_keywords
published_date = sorted({value for (key, value) in 
article.meta_data.items() if key == 'pubdate'})

如何使用Python报纸从存储的HTML中提取

1 个答案: