Question

我想抓取网络上发布的新闻文章的确切发布时间。

有些网页有很好的格式标题，我可以提取“最后修改”或“发布日期”，标题中的信息很乱，但可用。（顺便说一下，metadata_parser帮助很多！）

但像BBC和CNN这样的大型新闻机构并未在html标题中添加日期和时间信息。所以我试图从html代码获取日期并发布时间。

对于BBC，日期时间嵌入如下：

<div data-timestamp-inserted="true" class="date date--v2" data-seconds="1447658338" data-datetime="16 November 2015">16 November 2015</div>

对于CNN，它就像：

<p class="update-time">Updated 0137 GMT (0937 HKT) November 16, 2015 <span id="js-pagetop_video_source" class="video__source top_source">| Video Source: <a href="http://www.cnn.com/">CNN</a></span></p>

对于nytimes，

<p class="byline-dateline"><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person">By <span class="byline-author" data-byline-name="AURELIEN BREEDEN" itemprop="name">AURELIEN BREEDEN</span>, </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person"><span class="byline-author" data-byline-name="KIMIKO DE FREYTAS-TAMURA" itemprop="name">KIMIKO DE FREYTAS-TAMURA</span> and </span><span class="byline" itemprop="author creator" itemscope="" itemtype="http://schema.org/Person" itemid="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html"><a href="http://topics.nytimes.com/top/reference/timestopics/people/b/katrin_bennhold/index.html" rel="author" title="More Articles by KATRIN BENNHOLD"><span class="byline-author" data-byline-name="KATRIN BENNHOLD" itemprop="name">KATRIN BENNHOLD</span></a></span><time class="dateline" datetime="2015-11-16" itemprop="datePublished" content="2015-11-16">NOV. 16, 2015</time></p>

可以看出，几乎每个新闻机构都有自己的方式将数据和时间放在网页上。

我的问题是，是否可以使用BeautifulSoup中的某种模糊搜索和类型的包来提取日期时间信息，这样我就不必为每个网站编写规则了？

谢谢！

Answer 1

根据我的经验和拙见，刮取通用信息的最佳方法是使用NER (Named-Entity Recognition)系统。

我建议使用Scrapinghub的 webstruct 库：

Webstruct是一个用于创建有效的统计NER系统的库   关于HTML数据，即用于构建提取命名的工具的库   来自的实体（地址，组织名称，营业时间等）   网页。

与大多数NER系统不同，webstruct不仅适用于HTML数据   文本数据。这允许定义使用HTML结构的功能，以及   还可以将注释结果嵌入到HTML中。

Github存储库：https://github.com/scrapinghub/webstruct

文档：http://webstruct.readthedocs.org/en/latest/

<强>更新

由于您需要抓日期，您还可以使用 Dateparser ：

dateparser提供了几乎可以轻松解析本地化日期的模块网页上常见的任何字符串格式。

Github存储库：https://github.com/scrapinghub/dateparser

文档：https://dateparser.readthedocs.org/en/latest/

使用Python在通用网页中查找“日期”

1 个答案: